In Machine Learning algorithms, Reinforcement Learning (RL) is lesser known than its more famous peers, Supervised and Unsupervised Learning. This is primarily because of its complexity. Unlike its peers, in RL, the model/agent learns through its own experiences; the agent is required to determine its actions through sequential decision-making to maximize the reward it can receive. The foundation of RL borrows from the framework of a Markov Decision Process (MDP), where some of the key elements are:
- The agent is an autonomous system that learns to behave in an unknown environment, like a toddler learning to walk.
- The environment is the physical or virtual world that the agent interacts with.
- As the name suggests, the agent receives rewards for taking an action. For example, it can be appreciation from the parents for a toddler or a score the agent earns when learning to play a new game. The reward is not always positive in real life, depending on the action.
- Policy maps state to actions, i.e., the policy drives an agent to take a particular action in a given state.
The objective of reinforcement learning is to find the optimum policy for the process that maximizes the expected reward. Over episodes, the agent learns to choose its action through trial and error, which maximizes its expected reward (refer to the image below).
Here is the pseudo-code of a RL process:
function reinforcement_learning(env, policy, num_episodes):
for episode in range(num_episodes):
done = False
state = env.reset()
while not done:
action = policy(state)
next_state, reward, done = env.step(action)
policy.update(state, action, reward, next_state, done)
state = next_stat
return policy
From the description, one thing becomes clear: by its design itself, RL is a potent technique that can simulate very complex real-world scenarios. Thus, the application of Reinforcement learning is known to be more common in gaming or robotics, where it is often impossible to find labeled data. However, RL is not restricted to one domain, and at this point, it is safe to say that RL's most famous applications are AI chatbots like OpenAI ChatGPT and Google Gemini. Both are created with large language models (LLMs), and RLHF (Reinforcement Learning on Human Feedback) has been used to improve their performances.
However, apart from these few instances, we still have not witnessed any massive implementation of RL in the industries, even after proving its worth on multiple occasions. Therefore, this article explores a few domains, like supply chain management and digital marketing, where RL can outperform conventional methods.
In these following examples of different applications, two terms we will mostly encounter are two types of RL methods, i.e., Deep Q-learning and Policy gradient. In reality, several RL methods are broadly categorized into two main types: model-based and model-free RL methods.
- In Model-based reinforcement learning methods, the agent learns the model of the environment; therefore, the agent can predict future rewards and states, which can be used to improve the agent's policy.
- Model-free reinforcement learning methods do not learn a model of the environment. Instead, they try to learn the optimal policy from the rewards and transitions they observe after taking the actions. Most of the real-world applications fall into this category.
- Deep Q-learning is a value-based model-free RL method that uses deep learning to learn a value function. The value of a state is estimated from the expected rewards that the agent can receive from that state. In Q-learning, the agent finds the optimal state-action value to obtain the optimal policy indirectly.
- Deep Deterministic Policy Gradient (DDPG) is also a policy-based model-free method that uses deep learning to learn a policy directly.
With this introduction, we will focus on recent developments in the applications of reinforcement learning in the next section.
Supply Chain Management
The one sector that can truly benefit from RL is the CPG industry. CPG industries struggle to optimize their multi-echelon supply chains, where they try to balance their multiple distribution points from their starting points to the consumers to reduce costs and improve customer satisfaction.
In recent times, multiple studies have shown that RL can manage the supply chain much more successfully than traditional inventory control policies. To name a few, the famous beer game problem, which is used to explain how small fluctuations in demand at the retail level can have drastic fluctuations in demand at the wholesale, distributor, manufacturer, and eventually at the raw material supplier levels, Oroojlooyjadid et al. have shown that when other participants follow a base-stock policy a DQN agent can successfully come up with a near-optimal order quantities1. The beer game is a serial supply chain network of four agents—a retailer, a warehouse, a distributor, and a manufacturer- each of whom must make independent replenishment decisions with limited information. In this example, the DQN agent runs a Q-learning algorithm to minimize the total cost of the game. The agent determines the order amount in each step, considering the inventory level, demand, etc. The objective of the cost is only learned at the end of the game, so the agent's reward for all the steps is estimated through a feedback scheme.
In a different study, Parez et al. demonstrated that using the proximal policy optimization (PPO) method, RL can achieve an efficient and efficient policy to adapt to any network disruptions2. This case study involves one month’s inventory management for a multi-echelon, make-to-order supply network of a single product with fluctuating demand. Inventory management optimization aims to maximize the time-averaged expected profit of the supply network.
Guangrui has also shown through a similar study that a PPO-based RL algorithm can make far better ordering decisions for a multi-echelon network (consisting of only two distribution centers and four retailers) to make a considerably higher profit in comparison to a traditional (s, S) policy3. Stranieri et al. have also come up with a similar observation that the policy-based DRL (Deep Reinforcement Learning) method outperforms a static (s, Q) policy, and along with that, they also developed an open-source library for solving the Inventory management problem4.
Digital Marketing
The next domain we want to discuss is Digital marketing. The recommender system (RS) is now almost synonymous with Digital marketing. However, traditional RS uses collaborative filtering, content-based filtering, and hybrid approaches. However, these techniques still cannot tackle the cold start problem and lack originality whenever they are subjected to biases. Deep learning models can address a few of these issues, but they are still computationally expensive. However, Deep Reinforcement learning models can be very efficient in such cases as they can tackle large state and action spaces. Big tech companies like Google and Netflix already use RL for their recommender systems. Recommender systems are essentially there to help users sort through huge collections of products by suggesting products in which the users might be interested. Google researchers have already shown that policy gradient-based top-K recommender systems can run live production systems on platforms like YouTube, where the action space is in the order of millions5. This study aimed to improve the user satisfaction metrics indicated by clicks or view times. They considered the sequences of user historical interactions with the system (YouTube) as States such as user feedback like click and watch time when videos are recommended. The action is to recommend the next set of videos and the reward is measured by the clicks and watch time. The critical breakthroughs of this work were applying off-policy correction to handle data biases and a novel top-K off-policy correction to adapt to top-K recommendations.
In a more recent study, Netflix constructed an algorithm where RL tried to balance the relevance and the user’s content evaluation time to increase user engagement in a time budget6. They explained that this problem is connected to the 0/1 Knapsack problem of theoretical computer science. In both cases, the objective is to identify a subset of the item where the cost of the item remains within the user's budget.
In addition to product recommendations, digital marketing can benefit in various ways by using RL, such as optimizing Dynamic pricing, advertising budgets, Customer Lifetime value estimation, etc7. Even RL can select the best content for online advertisement or email marketing. In a recent study, Singh et al. showed that RL can be more successful in selecting ad content than traditional A/B testing. Their study used the UCB algorithm to choose from 6 advertisements shown to customers on rotation as part of the digital campaign from a startup named yourfrstad.com, and the RL model could generate more CTR compared to A/B testing8.
Banking, Financial Services and Insurance (BFSI)
In the next section, we will focus on the current banking system. It has been reported that the total transaction value in digital payments may reach a whopping US $2,041 bn in 2023. Along with this, the number of digital fraud attacks has also increased exorbitantly, and with time, frauds are only getting more sophisticated. In such cases, often personal information gets stolen and used for unauthorized transactions. Traditional fraud detection systems use supervised machine-learning algorithms to maximize the fraud recall rate. However, the current situation demands a more robust solution to adapt as new fraud patterns emerge constantly.
In the next section, we will focus on the current banking system. It has been reported that the total transaction value in digital payments may reach a whopping US $2,041 bn in 2023. Along with this, the number of digital fraud attacks has also increased exorbitantly, and with time, frauds are only getting more sophisticated. In such cases, often personal information gets stolen and used for unauthorized transactions. Traditional fraud detection systems use supervised machine-learning algorithms to maximize the fraud recall rate. However, the current situation demands a more robust solution to adapt as new fraud patterns emerge constantly.
In a 2021 study, Vimal and Wadha compared the performance of the DQN agent against multiple supervised algorithms to detect frauds on a publicly available fraud dataset and observed that the DQN method could beat all algorithms except XGBoost which can eventually be improved with a better strategy9. In this study, the state is the transactions, and the action space is discreet, either approving or declining the transaction. The agent receives two rewards: a monetary reward, defined per the bank revenue model on credit/debit cards. In this system, the banks get an interchange fee from the merchant for every non-fraud transaction approved and lose everything when a fraudulent transaction is approved. The second reward is estimated from the balance between the fraud rate and the decline rate of the system.
In a recent study, Tekkali and Natarajan showed that a better feature extraction method like Rough set theory DQN can achieve almost 96% accuracy on a highly unbalanced credit card fraud dataset10. Along with improved accuracy, this approach has also been shown to have reduced processing times, a concern in traditional ML methods for a long time.
Healthcare
Designing a new drug is costly, sometimes taking decades. Recently, Deep learning has become quite prevalent in determining Drug-Traget interactions. However, RL can accelerate these processes even further by designing novel drugs, more commonly known as de novo drugs, and help reduce costs for pharmaceutical companies.
Zhang et al. developed a method using a based REINVENT algorithm to design molecules for a potential drug with unique ideal properties given the basic structure of a protein molecule (the amino acid sequence)11. Here, they used a Policy-based RL to fine-tune RNN-based proxies to generate molecules with a given desired property, where a drug–target affinity model was used to determine the reward function.
Apart from this, healthcare can benefit immensely from RL. For example, medical practitioners follow a similar approach to reinforcement learning while treating patients; they go through a sequential decision-making process of recommending treatments after assessing patients’ conditions, observing the outcome, and repeating the process when needed. Fatemi et al. investigated if offline RL (where the model is trained on limited data) can identify the treatments to avoid for high-risk patients, leading to no recover12. They termed this process as Dead-end Discovery (DeD). They used publicly available medical data for ICU patients and showed that the DQN network-based RL model could raise flags for fatal cases well ahead of their deaths. This way, the models can warn the doctors about the outstanding risks of any medical process.
Natural Language Processing
As mentioned in countless academic papers and blog posts similar to the current popular AI chatbots (mentioned in the introduction), new custom chatbots can also be tailored for specific industries/companies. Open-source LLMs such as GPT-NeoX, OpenAssistant, Alpaca or Vicuna (built on LlamA), and Grok (from xAI) can be very advantageous. Here, the developer can access the model architecture, core code, and model parameters (depending on the model) and quickly identify and rectify the source models' biases and issues. RLHF would play a significant role in developing these custom chatbots to generate human-like responses. Because the LLMs, often known as the pre-trained models, are trained on massive amounts of text scraped from the internet and need further fine-tuning through training with higher-quality data. RLHF is used to train and refine them further.
Conclusion
In the end, it can be said that reinforcement learning is the future of artificial intelligence, as the algorithm learns from experience. It is already being used to solve some of the most complex problems across domains, and its impact will only grow in the coming years. So, if we are looking for a data science technology that can make a real difference, reinforcement learning is the one to watch.
References:
- Afshin Oroojlooyjadid, Mohammad Reza Nazari, Lawrence V Snyder and Martin Takac. “A deep q-network for the beer game: Deep reinforcement learning for inventory optimization”.
- Hector. D. Perez, Christian. D. Hubbs, C. Li, and Ignacio. E. Grossmann. "Algorithmic Approaches to Inventory Management Optimization." Processes 9 (1): 1–17, 2021.
- Xie Guangrui. "Reinforcement Learning for Inventory Optimization Series II: An RL Model for A Multi-Echelon Network." Towards Data Science, 2022.
- Francesco Stranieri, Fabio Stella. A Deep Reinforcement Learning Approach to Supply Chain Inventory Management. IFIP International Conference on Artificial Intelligence Applications and Innovations, Pages 282-291, 2022.
- Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. "Top-k off-policy correction for a reinforce recommender system". In WSDM'19, pages 456–464, 2019.
- Ehtsham Elahi. "Reinforcement Learning for Budget Constrained Recommendations". Netflix Technology Blog, 2022.
- M. Mehdi Afsar, Trafford Crump, Behrouz Far. "Reinforcement Learning based Recommender Systems: A Survey". ACM Computing Surveys, Volume, 55,Issue 7, Article No.: 145pp 1–38, 2022.
- Vinay. Singh, Brijesh. Nanavati, Arpan. K. Kar. "How to Maximize Clicks for Display Advertisement in Digital Marketing? A Reinforcement Learning Approach". Inf Syst Front, 2022.
- S Vimal, K Kayathwal, H Wadhwa, G Dhama. "Application of deep reinforcement learning to payment fraud". arXiv preprint arXiv:2112.04236, 2021.
- Chandana G. Tekkali, Karthika Natarajan. "RDQN: ensemble of deep neural network with reinforcement learning in classification based on rough set theory for digital transactional fraud detection". Complex & Intelligent Systems, Springer, 2023.
- Y Zhang, S Li, M Xing, Q Yuan, H He, S Sun." Universal approach to de novo drug design for target proteins using deep reinforcement learning". - ACS omega, - ACS Publications, 2023.
- M Fatemi, TW Killian, J Subramanian, M Ghassemi. "Medical dead-ends and learning to identify high-risk states and treatments". Advances in Neural Information Processing Systems, roceedings.neurips.cc 2021.
AUTHOR - FOLLOW
Nairhita Samanta
Manager, Data Science, Tredence Inc.
Topic Tags
Next Topic
Loan Pricing: Leveraging Reinforcement Learning for Dynamic and Personalized Strategies
Next Topic