About
Technical executive and entrepreneur with deep expertise in engineering, distributed…
Activity
-
Last Monday was my last day at Databricks. I spent an incredible three years launching and scaling one of the fastest-growing businesses ever…
Last Monday was my last day at Databricks. I spent an incredible three years launching and scaling one of the fastest-growing businesses ever…
Liked by Sam Shah
-
We just announced that Weights & Biases is being acquired by CoreWeave, the AI Hyperscaler. We could not be prouder or more excited to join forces…
We just announced that Weights & Biases is being acquired by CoreWeave, the AI Hyperscaler. We could not be prouder or more excited to join forces…
Liked by Sam Shah
-
Some personal news: I’m thrilled to share that I’m joining Index Ventures! After an incredible run at Databricks, where I had the privilege of…
Some personal news: I’m thrilled to share that I’m joining Index Ventures! After an incredible run at Databricks, where I had the privilege of…
Liked by Sam Shah
Experience
Education
-
University of Michigan
-
Specialization in distributed systems & search.
Dissertation title: "Leveraging Context for File Search and Organization"
Volunteer Experience
-
Volunteer
State of California
- 6 months
Health
I lent my skills to build, manage, and scale models of the COVID-19 epidemic under a crisis timeline for California, FEMA, other states, and international partners. An expansive private-public effort across teams in the government, academia, and Silicon Valley, our work was instrumental in better controlling the pandemic. Part of this work was featured in the Michael Lewis book “The Premonition.”
-
Volunteer
Big Brothers Big Sisters of America
- Present 4 years
Children
Mentor for the Big Brother Big Sister program, where I’ve had the privilege of guiding a young mind through learning and growth.
Publications
-
LinkedIn Skills: Large-Scale Topic Extraction and Inference
RecSys 2014
"Skills and Expertise" is a data-driven feature on LinkedIn, the world's largest professional online social network, which allows members to tag themselves with topics representing their areas of expertise. In this work, we present our experiences developing this large-scale topic extraction pipeline, which includes constructing a folksonomy of skills and expertise and implementing an inference and recommender system for skills. We also discuss a consequent set of applications, such as…
"Skills and Expertise" is a data-driven feature on LinkedIn, the world's largest professional online social network, which allows members to tag themselves with topics representing their areas of expertise. In this work, we present our experiences developing this large-scale topic extraction pipeline, which includes constructing a folksonomy of skills and expertise and implementing an inference and recommender system for skills. We also discuss a consequent set of applications, such as Endorsements, which allows members to tag themselves with topics representing their areas of expertise and for their connections to provide social proof, via an "endorse" action, of that member's competence in that topic.
Other authors -
Hourglass: a Library for Incremental Processing on Hadoop
IEEE BigData
Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the…
Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the framework ensures that only the necessary subcomputations are performed. It is successfully used at LinkedIn, one of the largest online social networks, for many use cases in dashboarding and machine learning. Hourglass is open source and freely available.
Other authorsSee publication -
Root cause detection in a service-oriented architecture
SIGMETRICS 2013 - Special Interest Group on Measurement and Evaluation
Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user’s request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to…
Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user’s request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers.
This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.Other authorsSee publication -
Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph
5th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '13)
Social networks often require the ability to perform low latency graph computations in the user request path. For example, at LinkedIn, we show the graph distance and common connections when we show a profile in any context on the site. To do this, we have developed a distributed and partitioned graph system that scales to hundreds of millions of members and their connections, handling hundreds of thousands of queries per second.
To accomplish this scaling, real time distributed graph…Social networks often require the ability to perform low latency graph computations in the user request path. For example, at LinkedIn, we show the graph distance and common connections when we show a profile in any context on the site. To do this, we have developed a distributed and partitioned graph system that scales to hundreds of millions of members and their connections, handling hundreds of thousands of queries per second.
To accomplish this scaling, real time distributed graph traversal is converted into set intersections that are accomplished in a scatter/gather manner. A network performance bottleneck forms on the gather node as it must merge partial results from many machines. In this paper, we present a modified greedy set cover algorithm that is used to locate the minimal set of machines that can serve the partial results. Our results indicate that we are able to save 25% in the 99th percentile latency of these graph distance calculations for LinkedIn’s social graph workloads.Other authors -
Organizational Overlap on Social Networks and its Applications
WWW 2013 - 22nd Intenational World Wide Web Conference
Online social networks have become important for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed because two individuals might know each other, but may not have established a connection on the site. Therefore, link prediction and recommendations are important tasks for any online social network. In this paper, we address the problem of computing edge affinity between two users on a…
Online social networks have become important for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed because two individuals might know each other, but may not have established a connection on the site. Therefore, link prediction and recommendations are important tasks for any online social network. In this paper, we address the problem of computing edge affinity between two users on a social network, based on the users belonging to organizations such as companies, schools, and online groups. We present experimental insights from social network data on organizational overlap, a novel mathematical model to compute the probability of connection between two people based on organizational overlap, and experimental validation of this model based on real social network data. We also present novel ways in which the organization overlap model can be applied to link prediction and community detection, which in itself could be useful for recommending entities to follow and generating personalized news feed.
Other authorsSee publication -
The "Big Data" ecosystem at LinkedIn
SIGMOD 2013 - Special Interest Group on Management Of Data
The use of large-scale data mining and machine learning has prolif- erated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn’s Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the “last mile” issues in providing a rich developer…
The use of large-scale data mining and machine learning has prolif- erated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn’s Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the “last mile” issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.
Other authors -
Metaphor: a system for related search recommendations
In Proceedings of the 21st International Conference on Information and Knowledge Management (CIKM 2012)
Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members' search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 174~million members…
Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members' search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 174~million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity. The system, which has been in production for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation click behavior. We also discuss some of the practical concerns in deploying related search recommendations.
Other authors -
Avatara: OLAP for Web-scale Analytics Products
VLDB 2012 - International Conference on Very Large Databases
Multidimensional data generated by members on websites has seen massive growth in recent years. OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value. For example, LinkedIn, the largest professional social network, provides its professional members rich analytics features like "Who's Viewed My Profile?" and "Who's Viewed This Job?" The data behind these features form…
Multidimensional data generated by members on websites has seen massive growth in recent years. OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value. For example, LinkedIn, the largest professional social network, provides its professional members rich analytics features like "Who's Viewed My Profile?" and "Who's Viewed This Job?" The data behind these features form cubes that must be efficiently served at scale, and can be neatly sharded to do so. To serve our growing 160~million member base, we built a scalable and fast OLAP serving system called Avatara to solve this many, small cubes problem. At LinkedIn, Avatara has been powering several analytics features on the site for the past two years.
Other authors -
Social Networking in Developing Regions
International conference on Information and Communication Technologies and Development (ICTD 2012)
Online social networks have enjoyed significant growth over the past several years. With improvements in mobile and Internet penetration, developing countries are participating in increasing numbers in online communities. This paper provides the first large scale and detailed analysis of social networking usage in developing country contexts. The analysis is based on data from LinkedIn, a professional social network with over 120~million members worldwide. LinkedIn has members from every…
Online social networks have enjoyed significant growth over the past several years. With improvements in mobile and Internet penetration, developing countries are participating in increasing numbers in online communities. This paper provides the first large scale and detailed analysis of social networking usage in developing country contexts. The analysis is based on data from LinkedIn, a professional social network with over 120~million members worldwide. LinkedIn has members from every country in the world, including millions in Africa, Asia, and South America. The goal of this paper is to provide researchers a detailed look at the growth, adoption, and other characteristics of social networking usage in developing countries compared to the developed world. To this end, we discuss several themes that illustrate different dimensions of social networking use, ranging from interconnectedness of members in geographic regions to the impact of local languages on social network participation.
Other authors -
Serving Large-scale Batch Computed Data with Project Voldemort
In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST 2012)
Project Voldemort is a general purpose distributed storage and serving system inspired by Amazon's Dynamo. We present a novel pipeline for computing, deploying and serving massive read-only data sets that we have integrated into Voldemort. This pipeline builds on the inherent fault-tolerance and horizontal scalability of the Dynamo architecture to solve a common problem: performing massive data loads into an online system without impacting serving performance. The data generation is done…
Project Voldemort is a general purpose distributed storage and serving system inspired by Amazon's Dynamo. We present a novel pipeline for computing, deploying and serving massive read-only data sets that we have integrated into Voldemort. This pipeline builds on the inherent fault-tolerance and horizontal scalability of the Dynamo architecture to solve a common problem: performing massive data loads into an online system without impacting serving performance. The data generation is done offline using Hadoop, and our system effectively bridges the gap between batch-oriented clusters and real-time serving systems. As a production system at LinkedIn, this has helped us rapidly build out various data-intensive social products that are computed offline, and then publish the multi-TB result data to the live production throughout the day.
Other authorsSee publication
More activity by Sam
-
This is a great role in my team. Come work on the cutting edge of use cases using GenAI and ML for Enterprise. Make magic with the best team.
This is a great role in my team. Come work on the cutting edge of use cases using GenAI and ML for Enterprise. Make magic with the best team.
Liked by Sam Shah
-
My team is hiring for multiple roles focused on enhancing the quality of LLM-powered products, including Databricks Assistant and AI/BI Genie. If…
My team is hiring for multiple roles focused on enhancing the quality of LLM-powered products, including Databricks Assistant and AI/BI Genie. If…
Liked by Sam Shah
-
I recently started a new job at Eightfold AI as their VP of AI, reporting to the brilliant co-founder/CTO Varun Kacholia. Previously, Varun ran…
I recently started a new job at Eightfold AI as their VP of AI, reporting to the brilliant co-founder/CTO Varun Kacholia. Previously, Varun ran…
Liked by Sam Shah
-
Today marks the conclusion of a rewarding 14-year journey at LinkedIn. It has been an incredible ride, and I am deeply grateful for the…
Today marks the conclusion of a rewarding 14-year journey at LinkedIn. It has been an incredible ride, and I am deeply grateful for the…
Liked by Sam Shah
-
After a transformative four and a half years, I have decided to move on from Pinterest. It's been an incredible journey, from shaping the creator…
After a transformative four and a half years, I have decided to move on from Pinterest. It's been an incredible journey, from shaping the creator…
Liked by Sam Shah
-
Going cloud native? Discover what to prioritize in your observability platform to effectively manage costs and data.
Going cloud native? Discover what to prioritize in your observability platform to effectively manage costs and data.
Liked by Sam Shah
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore MoreOthers named Sam Shah in United States
142 others named Sam Shah in United States are on LinkedIn
See others named Sam Shah