Sam Shah

Sam Shah

San Francisco, California, United States
9K followers 500+ connections

About

Technical executive and entrepreneur with deep expertise in engineering, distributed…

Activity

Join now to see all activity

Experience

  • Databricks Graphic

    Databricks

    San Francisco, California, United States

  • -

    San Francisco, California, United States

  • -

    San Francisco, California, United States

  • -

    San Francisco, California, United States

  • -

    Palo Alto, California, United States

  • -

    Mountain View, California, United States

  • -

Education

  • University of Michigan Graphic

    University of Michigan

    -

    Specialization in distributed systems & search.
    Dissertation title: "Leveraging Context for File Search and Organization"

Volunteer Experience

  • State of California Graphic

    Volunteer

    State of California

    - 6 months

    Health

    I lent my skills to build, manage, and scale models of the COVID-19 epidemic under a crisis timeline for California, FEMA, other states, and international partners. An expansive private-public effort across teams in the government, academia, and Silicon Valley, our work was instrumental in better controlling the pandemic. Part of this work was featured in the Michael Lewis book “The Premonition.”

  • Big Brothers Big Sisters of America Graphic

    Volunteer

    Big Brothers Big Sisters of America

    - Present 4 years

    Children

    Mentor for the Big Brother Big Sister program, where I’ve had the privilege of guiding a young mind through learning and growth.

Publications

  • LinkedIn Skills: Large-Scale Topic Extraction and Inference

    RecSys 2014

    "Skills and Expertise" is a data-driven feature on LinkedIn, the world's largest professional online social network, which allows members to tag themselves with topics representing their areas of expertise. In this work, we present our experiences developing this large-scale topic extraction pipeline, which includes constructing a folksonomy of skills and expertise and implementing an inference and recommender system for skills. We also discuss a consequent set of applications, such as…

    "Skills and Expertise" is a data-driven feature on LinkedIn, the world's largest professional online social network, which allows members to tag themselves with topics representing their areas of expertise. In this work, we present our experiences developing this large-scale topic extraction pipeline, which includes constructing a folksonomy of skills and expertise and implementing an inference and recommender system for skills. We also discuss a consequent set of applications, such as Endorsements, which allows members to tag themselves with topics representing their areas of expertise and for their connections to provide social proof, via an "endorse" action, of that member's competence in that topic.

    Other authors
  • Hourglass: a Library for Incremental Processing on Hadoop

    IEEE BigData

    Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the…

    Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the framework ensures that only the necessary subcomputations are performed. It is successfully used at LinkedIn, one of the largest online social networks, for many use cases in dashboarding and machine learning. Hourglass is open source and freely available.

    Other authors
    See publication
  • Root cause detection in a service-oriented architecture

    SIGMETRICS 2013 - Special Interest Group on Measurement and Evaluation

    Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user’s request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to…

    Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user’s request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers.

    This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.

    Other authors
    See publication
  • Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph

    5th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '13)

    Social networks often require the ability to perform low latency graph computations in the user request path. For example, at LinkedIn, we show the graph distance and common connections when we show a profile in any context on the site. To do this, we have developed a distributed and partitioned graph system that scales to hundreds of millions of members and their connections, handling hundreds of thousands of queries per second.

    To accomplish this scaling, real time distributed graph…

    Social networks often require the ability to perform low latency graph computations in the user request path. For example, at LinkedIn, we show the graph distance and common connections when we show a profile in any context on the site. To do this, we have developed a distributed and partitioned graph system that scales to hundreds of millions of members and their connections, handling hundreds of thousands of queries per second.

    To accomplish this scaling, real time distributed graph traversal is converted into set intersections that are accomplished in a scatter/gather manner. A network performance bottleneck forms on the gather node as it must merge partial results from many machines. In this paper, we present a modified greedy set cover algorithm that is used to locate the minimal set of machines that can serve the partial results. Our results indicate that we are able to save 25% in the 99th percentile latency of these graph distance calculations for LinkedIn’s social graph workloads.

    Other authors
  • Organizational Overlap on Social Networks and its Applications

    WWW 2013 - 22nd Intenational World Wide Web Conference

    Online social networks have become important for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed because two individuals might know each other, but may not have established a connection on the site. Therefore, link prediction and recommendations are important tasks for any online social network. In this paper, we address the problem of computing edge affinity between two users on a…

    Online social networks have become important for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed because two individuals might know each other, but may not have established a connection on the site. Therefore, link prediction and recommendations are important tasks for any online social network. In this paper, we address the problem of computing edge affinity between two users on a social network, based on the users belonging to organizations such as companies, schools, and online groups. We present experimental insights from social network data on organizational overlap, a novel mathematical model to compute the probability of connection between two people based on organizational overlap, and experimental validation of this model based on real social network data. We also present novel ways in which the organization overlap model can be applied to link prediction and community detection, which in itself could be useful for recommending entities to follow and generating personalized news feed.

    Other authors
    See publication
  • The "Big Data" ecosystem at LinkedIn

    SIGMOD 2013 - Special Interest Group on Management Of Data

    The use of large-scale data mining and machine learning has prolif- erated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn’s Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the “last mile” issues in providing a rich developer…

    The use of large-scale data mining and machine learning has prolif- erated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn’s Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the “last mile” issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

    Other authors
  • Metaphor: a system for related search recommendations

    In Proceedings of the 21st International Conference on Information and Knowledge Management (CIKM 2012)

    Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members' search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 174~million members…

    Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members' search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 174~million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity. The system, which has been in production for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation click behavior. We also discuss some of the practical concerns in deploying related search recommendations.

    Other authors
  • Avatara: OLAP for Web-scale Analytics Products

    VLDB 2012 - International Conference on Very Large Databases

    Multidimensional data generated by members on websites has seen massive growth in recent years. OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value. For example, LinkedIn, the largest professional social network, provides its professional members rich analytics features like "Who's Viewed My Profile?" and "Who's Viewed This Job?" The data behind these features form…

    Multidimensional data generated by members on websites has seen massive growth in recent years. OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value. For example, LinkedIn, the largest professional social network, provides its professional members rich analytics features like "Who's Viewed My Profile?" and "Who's Viewed This Job?" The data behind these features form cubes that must be efficiently served at scale, and can be neatly sharded to do so. To serve our growing 160~million member base, we built a scalable and fast OLAP serving system called Avatara to solve this many, small cubes problem. At LinkedIn, Avatara has been powering several analytics features on the site for the past two years.

    Other authors
  • Social Networking in Developing Regions

    International conference on Information and Communication Technologies and Development (ICTD 2012)

    Online social networks have enjoyed significant growth over the past several years. With improvements in mobile and Internet penetration, developing countries are participating in increasing numbers in online communities. This paper provides the first large scale and detailed analysis of social networking usage in developing country contexts. The analysis is based on data from LinkedIn, a professional social network with over 120~million members worldwide. LinkedIn has members from every…

    Online social networks have enjoyed significant growth over the past several years. With improvements in mobile and Internet penetration, developing countries are participating in increasing numbers in online communities. This paper provides the first large scale and detailed analysis of social networking usage in developing country contexts. The analysis is based on data from LinkedIn, a professional social network with over 120~million members worldwide. LinkedIn has members from every country in the world, including millions in Africa, Asia, and South America. The goal of this paper is to provide researchers a detailed look at the growth, adoption, and other characteristics of social networking usage in developing countries compared to the developed world. To this end, we discuss several themes that illustrate different dimensions of social networking use, ranging from interconnectedness of members in geographic regions to the impact of local languages on social network participation.

    Other authors
  • Serving Large-scale Batch Computed Data with Project Voldemort

    In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST 2012)

    Project Voldemort is a general purpose distributed storage and serving system inspired by Amazon's Dynamo. We present a novel pipeline for computing, deploying and serving massive read-only data sets that we have integrated into Voldemort. This pipeline builds on the inherent fault-tolerance and horizontal scalability of the Dynamo architecture to solve a common problem: performing massive data loads into an online system without impacting serving performance. The data generation is done…

    Project Voldemort is a general purpose distributed storage and serving system inspired by Amazon's Dynamo. We present a novel pipeline for computing, deploying and serving massive read-only data sets that we have integrated into Voldemort. This pipeline builds on the inherent fault-tolerance and horizontal scalability of the Dynamo architecture to solve a common problem: performing massive data loads into an online system without impacting serving performance. The data generation is done offline using Hadoop, and our system effectively bridges the gap between batch-oriented clusters and real-time serving systems. As a production system at LinkedIn, this has helped us rapidly build out various data-intensive social products that are computed offline, and then publish the multi-TB result data to the live production throughout the day.

    Other authors
    See publication
Join now to see all publications

More activity by Sam

View Sam’s full profile

  • See who you know in common
  • Get introduced
  • Contact Sam directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Sam Shah in United States

Add new skills with these courses