Data & AI Report – Data Science Trends July 2024

August 5, 2024

1510

Trends in data science have brought a fresh wave of excitement to the data and analytics landscape this July. We’re seeing major moves towards scalability, efficient governance, and AI capabilities. Additionally, Dr. Randy Olson shows us just how far creative data use can take you—literally! Turns out, data science isn’t just about numbers, it can plan one epic road trip too! 

Firstly, discord’s transition to Dagster and dbt for data orchestration 

This month, Discord announced a major overhaul of their data orchestration infrastructure, moving from their in-house system, Derived, to a combination of Dagster and dbt. As their platform and user base expanded, the need for enhanced self-service capabilities and robust observability became evident. This decision was driven by the necessity for declarative automation, a modern unified interface, reliability on Kubernetes, and seamless integration with existing tools. 

After evaluating open-source options like Argo and Prefect, Discord chose Dagster for orchestration and dbt for data modeling. This transition has enabled them to support over 2,000 dbt tables, enhancing their ability to deliver seamless service and insightful data analytics while scaling efficiently. 

Read about it here

Meta unveils Llama 3.1 

This month, Meta introduced Llama 3.1, a massive leap in open-source AI. The Llama 3.1 405B model brings unmatched flexibility and state-of-the-art capabilities, unlocking new workflows like synthetic data generation and model distillation. Additionally, Meta is enhancing the Llama ecosystem with new security tools and a reference system. Over 25 partners, including AWS and Google Cloud, will offer services from day one. 

 

Llama 3.1 models feature expanded context lengths to 128K, multilingual support, and strong performance across benchmarks. Upgraded 8B and 70B models enhance capabilities in general knowledge, tool use, and translation. 

Read Meta’s full update

Building a data-driven analytics team at DoorDash 

Jessica Lachs, DoorDash’s VP of Analytics & Data Science, shares insights on what it means to be truly data-driven and how to structure an analytics team. Having joined DoorDash as the first General Manager in 2014, Lachs has built the analytics team from the ground up and now leads global analytics, including the Wolt Analytics team post-acquisition. 

Not only does Lachs highlight that the term “analytics” can be ambiguous, encompassing data science, business intelligence, product analytics, machine learning, and BizOps. She also emphasizes that to build a data-driven organisation, founders should focus on desired outcomes rather than semantics. At DoorDash, the role of analytics has evolved with the company’s growth, shifting from gut instinct decisions to data-centric strategies. Initially, DoorDash used quasi-experimental methods due to limited data, but as the company matured, they invested in scalable data models and advanced experimentation capabilities, expanding their analytics scope to drive better decision-making. 

Read the full post here

Databricks’ migration to unity catalog for data governance 

In a recent blog post, the Data Platform team at Databricks shared insights into their migration to Unity Catalog for enhanced data governance. As the company grows, establishing secure, compliant, and cost-effective data operations has become a priority. With thousands of employees analysing data, consistent governance standards are essential, making the migration to Unity Catalog a top priority. 

The blog outlines the challenges and benefits of migrating from the default Hive Metastore (HMS) to Unity Catalog. While HMS lacked fine-grained access controls, lineage support, audit logs, and effective search integration, UC provided these features out-of-the-box. Therefore, the team chose a transformational approach, selectively migrating datasets to establish a structured governance framework. This strategy required more effort initially, but enabled clear data ownership, naming conventions, and intentional access, setting the stage for future governance policies.

Read the blog

Finally, some creative Data use!

Dr. Randy Olson, a full stack data scientist and AI researcher, utilised his expertise in machine learning to develop an optimal search strategy.  

He approached this task using the Traveling Salesman Problem (TSP) algorithm, which aims to find the shortest route that visits each city exactly once and returns to the starting point.  

Dr. Olson applied three specific restrictions:  

  1. The trip must stop in all 48 contiguous U.S. states 
  2. Only visit National Natural Landmarks, National Historic Sites, National Parks, or National Monuments #
  3. Be taken entirely by car without leaving the U.S. 

Want to take the trip? The route spans 13,699 miles and requires 224 hours (or 9.33 days) of driving, assuming no traffic. You can find the full itinerary here. 

Dr Randy Olsen used Data to design the optimum road trip across the U.S. Showing how useful data can be. Data science trends really are everywhere!

Olsen’s epic road trip

To conclude 

July highlighted several key trends in data and analytics. The push for scalability is evident in Discord’s adoption of Dagster and dbt, and Databricks’ migration to Unity Catalog for better data governance. The importance of building effective data teams was underscored by DoorDash’s approach to analytics leadership. Another notable trend is the growing emphasis on enhanced self-service capabilities and robust observability in data platforms. These themes point towards a future focused on scalable infrastructure, efficient governance, structured teams, and innovative AI applications.

If you’re interested in how we can help scale your data team, get in touch.

Data & AI Report – June 2024

July 2, 2024

1510

Welcome to our June Data & AI report!  

We’re covering some exciting news this month… who knew data catalogs could be so competitive? We’ve also got some interesting updates from NVIDIA and Netflix. 

Let’s get stuck in! 

Open Source battles: Databricks open sources Unity Catalog…

…live at the Data & AI summit

Following Snowflake’s announcement to open source their Polaris Catalog “within the next 90 days”, Matei Zaharia, Databricks’ CTO & Cofounder, went one up and opened the repo on his laptop during his Keynote speech at the Data & AI summit, navigated to the “danger zone” and in front of everyone, made the repo public there and then. Making Databricks the first to go open source in the industry. 

Gif of someone eating popcorn wearing 3d glasses. A joke referencing the drama Databricks opensourcing their data catalogs Data & AI Summit before Snowflake launched theirs.

Watch a video of the moment it went live here.

In Databricks’ announcement, they shared the reasoning behind their decision to make this public, explaining that “most data platforms today are walled gardens” going on to say “By open-sourcing Unity Catalog, we are giving organisations an open foundation for their current and future workloads.” 

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training LLMs 

NVIDIA has launched an open synthetic data generation pipeline for training large language models. The Nemotron-4 340B family offers advanced instruct and reward models, along with a dataset for generative AI training.  

NVIDIA's Synthetic Response Data diagram

This system provides developers with a free and scalable solution to create synthetic data for building powerful LLMs, enhancing performance and accuracy. The models are designed to work seamlessly with NVIDIA NeMo and TensorRT-LLM for efficient model training and inference. 

Read NVIDIA’s full blog here.

Netflix share a recap of their Data Engineering Open Forum 

Netflix released a summary this month of the sessions from their Data Engineering Open Forum back in April. (along with recordings of all the talks!) 

One session introduced Netflix’s “Auto Remediation” feature, which uses machine learning to handle job errors more efficiently. Jide Ogunjobi talked about using generative AI to help organizations easily manage and query their large data systems. 

Tulika Bhatt explained how Netflix manages 18 billion daily impressions and the importance of real-time data for recommendations. We found Tulika’s talk particularly interesting as it highlighted the creative solutions Netflix employs to balance scalability and cost while delivering real-time data. 

Jessica Larson shared her experience building a new data platform after GDPR, focusing on data protection and compliance. Clark Wright from Airbnb discussed their new Data Quality Score to improve data quality. 

You can read about, and watch all of the talks here 

How Machine Learning is transforming Online Banking security 

Zachary Amos’ recent blog explores how behavioral biometrics can drastically reduce online banking fraud. This ML-driven technology works in the background, monitoring user behavior like mouse movements and keystrokes to spot anything unusual. It processes data in real-time, handling multiple users at once, making it a more streamlined and user-friendly security solution than traditional Multi-Factor Authentication. 

Zachary’s insights show the power of machine learning in boosting security. As cyber threats become more sophisticated, using technology like this ensures accounts stay secure and protected. 

To conclude 

June has been a month full of exciting open-source updates. Databricks made waves by open-sourcing Unity Catalog live on stage, while NVIDIA launched a synthetic data generation pipeline for training large language models.  

We’re especially interested in these open-source developments. They represent a move towards greater collaboration and accessibility in the tech world. 

Want to discuss how we can help you or your data team? Get in touch, or check out our open roles. 

Data & AI Report – May 2024

June 7, 2024

1510

Welcome to our May Data & AI report! This month, we explore developments in data, AI, and machine learning. From transforming data warehouses to pioneering AI/ML applications.

Discover how industry leaders are pushing the boundaries of what’s possible⤵️

Are data warehouses evolving beyond analytics?

Mikkel Dengsøe, founder of Synq released some interesting research about the evolving role of data warehouses this month. Once the domain of reporting and analytics, we’re increasingly seeing them underpin crucial functions like AI/ML, automated marketing, and regulatory reporting. This evolution ups the stakes significantly and means that data accuracy is a top concern for most companies.

Mikkel also points out how data teams and their stacks are growing rapidly. Companies today manage thousands of models and juggle numerous daily jobs to keep things running smoothly. With more business-critical data and a surge in data assets, effective testing approaches are more vital than ever. It seems that basic tests won’t cut it anymore and that niche solutions will be essential to maintain data reliability.

Chart from Mikkel's blog that shows companies using data warehouses more and more for business critical operations, like AI, ML, Business ops and Reporting.

Data warehouses now have to support business-critical uses like AI/ML, automated marketing, and regulatory reporting.

You can check out Mikkel’s full blog here.

What’s next for Uber’s ML platform, Michelangelo?

In a recent blog from Uber, the company shares the strides it’s made so far in machine learning (ML). Since 2016, Michelangelo, Uber’s centralised ML platform, is leveraging data to drive key functions like ETA predictions, rider-driver matching, rankings, and fraud detection. With around 400 active ML projects and over 5K models in production, Michelangelo manages 20K model training jobs monthly and delivers up to 10 million real-time predictions per second.

Screenshot from Uber's Blog, showing how real-time Machine Learning underpins the UberEats app's core user flow.

Real-time ML underpins Eater app core user flow.

The blog goes on to explain Uber’s plans to use generative AI and large language models (LLMs), with the Gen AI Gateway is at the forefront of its mission. With the aim to aid security, efficiency, and cost-effectiveness.

Read the full blog here.

LinkedIn launches LakeChime

This month, LinkedIn introduced LakeChime, a powerful data trigger service designed to enhance the efficiency of their extensive data lake. Handling billions of data points daily, LakeChime streamlines data processing by unifying data trigger semantics across both modern and traditional table formats like Hive and Iceberg.

Central to LakeChime is the Data Change Event (DCE), which captures updates within data tables and triggers downstream workflows via platforms like dbt or Airflow. This innovation ensures timely data availability and enhances pipeline efficiency.

Looking forward, LinkedIn plans to integrate LakeChime with dbt and Coral to automate incremental view maintenance, simplifying the creation of high-performance data pipelines.

Discover more about LakeChime in LinkedIn’s full blog post.

Comic Strip by Todd Comics, making a joke about data lakes that look well constructed and organised above the surface, but underneath the surface is an angry octopus with the Excel for a head, with the file name 'orders_final.xlsx'.

Spotlight on Slack’s female data engineers

Slack shared a blog last month highlighting the incredible work of their female data engineers across their various data teams.

By optimising data workflows with Apache Airflow and Apache Pinot, ensuring sub-second query latency. Senior Software Engineer, Jessica’s team is migrating from virtual machines to Kubernetes, using custom Python tools and automated deployments to boost efficiency.

Senior Software Engineer, Ramya talks about leading the migration from Spark 2 to Spark 3 on AWS EMR6, explaining how it enhances performance and reduces reliance on legacy systems.

Shrushti, another Senior Software Engineer transitioned Slack’s data ingestion from Secor to Bedrock and is now moving to Kafka Connect for real-time streaming. A shift that aligns with industry standards and improves system adaptability.

It’s a really interesting read and shines a light on Slack’s dedication to diversity and inclusion, as well as some of the incredible ways they’re using data. Read the full blog post to meet more inspiring engineers and discover the innovative projects shaping the future at Slack.

In conclusion…

As data continues to grow in volume and complexity, the strategies and technologies we employ must evolve. How will these innovations from Synq, Uber, LinkedIn, and Slack shape your business?

To stay ahead, organisations must keep pace with technological advancements with a culture of continuous learning and adaptation.

Want to discuss how we can help you or your data team? Get in touch, or check out our open roles.

Data & AI Report – April 2024

May 1, 2024

1510
Welcome to our first monthly update on data and AI. No need to scroll endlessly through news sites, we’ve compiled the month’s must-know developments right here!

April saw important developments in technology, highlighting investments and partnerships that emphasize the Netherlands’ involvement in the tech sector.

Google’s €640 Million Dutch Data Centre Project

Google announced a €640 million investment in a new data centre in Groningen, creating 125 jobs. This adds to Google’s total investment of over €3.8 billion in Dutch digital infrastructure since 2014. Read more

KLM Partners with Utrect University AI Labs

KLM Royal Dutch Airlines is collaborating with Utrecht University’s AI Labs to refine operational efficiency and minimize disruptions.

PhD students are developing algorithms to optimize crew and aircraft scheduling, and improve ground processes like baggage handling and passenger boarding. This partnership aims to enhance KLM’s ability to quickly adapt to changes, ensuring smoother operations and prioritising flights effectively through data. Read more

Google Launches Training Programs for AI, Cybersecurity, and Data Analytics

The U.S. Treasury and Google Cloud are partnering to boost data analytics and cybersecurity hiring, aligning with President Biden’s AI Executive Order.

New training programs, accessible via YouTube and Google Cloud Skills Boost, include courses on generative AI, cybersecurity, and data analytics, will equip individuals with the skills needed for digital transformation in the public sector.

Learners also get free access to generative AI tools, including Google’s interview prep tool, Interview Warmup. Read more

Gif showing Google Cloud's new Generative AI Interview Warmup tool.

Source: Google Cloud

AI Breakthrough in Breast Cancer Risk Assessment

Danish and Dutch researchers have advanced breast cancer risk assessment by combining an AI diagnostic tool with a mammographic texture model, under the leadership of Dr. Andreas D. Lauritzen.

This integrated approach improves the prediction of both short- and long-term breast cancer risks, identifying high-risk women more effectively. The innovation promises earlier cancer detection and could alleviate the strain on healthcare systems caused by a shortage of specialist breast radiologists. Read more

These developments underscore a growing need for expert Data, AI, and ML talent. Reach out to discuss how we can help to drive your innovation forward.

contact our team.