Data & AI Report Archives

Welcome to our June Data & AI report!

We’re covering some exciting news this month… who knew data catalogs could be so competitive? We’ve also got some interesting updates from NVIDIA and Netflix.

Let’s get stuck in!

Open Source battles: Databricks open sources Unity Catalog…

…live at the Data & AI summit

Following Snowflake’s announcement to open source their Polaris Catalog “within the next 90 days”, Matei Zaharia, Databricks’ CTO & Cofounder, went one up and opened the repo on his laptop during his Keynote speech at the Data & AI summit, navigated to the “danger zone” and in front of everyone, made the repo public there and then. Making Databricks the first to go open source in the industry.

Gif of someone eating popcorn wearing 3d glasses. A joke referencing the drama Databricks opensourcing their data catalogs Data & AI Summit before Snowflake launched theirs.

Watch a video of the moment it went live here.

In Databricks’ announcement, they shared the reasoning behind their decision to make this public, explaining that “most data platforms today are walled gardens” going on to say “By open-sourcing Unity Catalog, we are giving organisations an open foundation for their current and future workloads.”

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training LLMs

NVIDIA has launched an open synthetic data generation pipeline for training large language models. The Nemotron-4 340B family offers advanced instruct and reward models, along with a dataset for generative AI training.

NVIDIA's Synthetic Response Data diagram

This system provides developers with a free and scalable solution to create synthetic data for building powerful LLMs, enhancing performance and accuracy. The models are designed to work seamlessly with NVIDIA NeMo and TensorRT-LLM for efficient model training and inference.

Read NVIDIA’s full blog here.

Netflix share a recap of their Data Engineering Open Forum

Netflix released a summary this month of the sessions from their Data Engineering Open Forum back in April. (along with recordings of all the talks!)

One session introduced Netflix’s “Auto Remediation” feature, which uses machine learning to handle job errors more efficiently. Jide Ogunjobi talked about using generative AI to help organizations easily manage and query their large data systems.

Tulika Bhatt explained how Netflix manages 18 billion daily impressions and the importance of real-time data for recommendations. We found Tulika’s talk particularly interesting as it highlighted the creative solutions Netflix employs to balance scalability and cost while delivering real-time data.

Jessica Larson shared her experience building a new data platform after GDPR, focusing on data protection and compliance. Clark Wright from Airbnb discussed their new Data Quality Score to improve data quality.

You can read about, and watch all of the talks here

How Machine Learning is transforming Online Banking security

Zachary Amos’ recent blog explores how behavioral biometrics can drastically reduce online banking fraud. This ML-driven technology works in the background, monitoring user behavior like mouse movements and keystrokes to spot anything unusual. It processes data in real-time, handling multiple users at once, making it a more streamlined and user-friendly security solution than traditional Multi-Factor Authentication.

Zachary’s insights show the power of machine learning in boosting security. As cyber threats become more sophisticated, using technology like this ensures accounts stay secure and protected.

To conclude

June has been a month full of exciting open-source updates. Databricks made waves by open-sourcing Unity Catalog live on stage, while NVIDIA launched a synthetic data generation pipeline for training large language models.

We’re especially interested in these open-source developments. They represent a move towards greater collaboration and accessibility in the tech world.

Want to discuss how we can help you or your data team? Get in touch, or check out our open roles.

Welcome to our May Data & AI report! This month, we explore developments in data, AI, and machine learning. From transforming data warehouses to pioneering AI/ML applications.

Discover how industry leaders are pushing the boundaries of what’s possible⤵️

Are data warehouses evolving beyond analytics?

Mikkel Dengsøe, founder of Synq released some interesting research about the evolving role of data warehouses this month. Once the domain of reporting and analytics, we’re increasingly seeing them underpin crucial functions like AI/ML, automated marketing, and regulatory reporting. This evolution ups the stakes significantly and means that data accuracy is a top concern for most companies.

Mikkel also points out how data teams and their stacks are growing rapidly. Companies today manage thousands of models and juggle numerous daily jobs to keep things running smoothly. With more business-critical data and a surge in data assets, effective testing approaches are more vital than ever. It seems that basic tests won’t cut it anymore and that niche solutions will be essential to maintain data reliability.

Chart from Mikkel's blog that shows companies using data warehouses more and more for business critical operations, like AI, ML, Business ops and Reporting.

Data warehouses now have to support business-critical uses like AI/ML, automated marketing, and regulatory reporting.

You can check out Mikkel’s full blog here.

What’s next for Uber’s ML platform, Michelangelo?

In a recent blog from Uber, the company shares the strides it’s made so far in machine learning (ML). Since 2016, Michelangelo, Uber’s centralised ML platform, is leveraging data to drive key functions like ETA predictions, rider-driver matching, rankings, and fraud detection. With around 400 active ML projects and over 5K models in production, Michelangelo manages 20K model training jobs monthly and delivers up to 10 million real-time predictions per second.

Screenshot from Uber's Blog, showing how real-time Machine Learning underpins the UberEats app's core user flow.

Real-time ML underpins Eater app core user flow.

The blog goes on to explain Uber’s plans to use generative AI and large language models (LLMs), with the Gen AI Gateway is at the forefront of its mission. With the aim to aid security, efficiency, and cost-effectiveness.

Read the full blog here.

LinkedIn launches LakeChime

This month, LinkedIn introduced LakeChime, a powerful data trigger service designed to enhance the efficiency of their extensive data lake. Handling billions of data points daily, LakeChime streamlines data processing by unifying data trigger semantics across both modern and traditional table formats like Hive and Iceberg.

Central to LakeChime is the Data Change Event (DCE), which captures updates within data tables and triggers downstream workflows via platforms like dbt or Airflow. This innovation ensures timely data availability and enhances pipeline efficiency.

Looking forward, LinkedIn plans to integrate LakeChime with dbt and Coral to automate incremental view maintenance, simplifying the creation of high-performance data pipelines.

Discover more about LakeChime in LinkedIn’s full blog post.

Comic Strip by Todd Comics, making a joke about data lakes that look well constructed and organised above the surface, but underneath the surface is an angry octopus with the Excel for a head, with the file name 'orders_final.xlsx'.

Spotlight on Slack’s female data engineers

Slack shared a blog last month highlighting the incredible work of their female data engineers across their various data teams.

By optimising data workflows with Apache Airflow and Apache Pinot, ensuring sub-second query latency. Senior Software Engineer, Jessica’s team is migrating from virtual machines to Kubernetes, using custom Python tools and automated deployments to boost efficiency.

Senior Software Engineer, Ramya talks about leading the migration from Spark 2 to Spark 3 on AWS EMR6, explaining how it enhances performance and reduces reliance on legacy systems.

Shrushti, another Senior Software Engineer transitioned Slack’s data ingestion from Secor to Bedrock and is now moving to Kafka Connect for real-time streaming. A shift that aligns with industry standards and improves system adaptability.

It’s a really interesting read and shines a light on Slack’s dedication to diversity and inclusion, as well as some of the incredible ways they’re using data. Read the full blog post to meet more inspiring engineers and discover the innovative projects shaping the future at Slack.

In conclusion…

As data continues to grow in volume and complexity, the strategies and technologies we employ must evolve. How will these innovations from Synq, Uber, LinkedIn, and Slack shape your business?

To stay ahead, organisations must keep pace with technological advancements with a culture of continuous learning and adaptation.

Data & AI Report – June 2024

Open Source battles: Databricks open sources Unity Catalog…

…live at the Data & AI summit

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training LLMs

Netflix share a recap of their Data Engineering Open Forum

How Machine Learning is transforming Online Banking security

To conclude

Want to discuss how we can help you or your data team? Get in touch, or check out our open roles.

Data & AI Report – May 2024

Are data warehouses evolving beyond analytics?

What’s next for Uber’s ML platform, Michelangelo?

LinkedIn launches LakeChime

Spotlight on Slack’s female data engineers

In conclusion…

Want to discuss how we can help you or your data team? Get in touch, or check out our open roles.

Data & AI Report – April 2024

Welcome to our first monthly update on data and AI. No need to scroll endlessly through news sites, we’ve compiled the month’s must-know developments right here!

Google’s €640 Million Dutch Data Centre Project

KLM Partners with Utrect University AI Labs

Google Launches Training Programs for AI, Cybersecurity, and Data Analytics

AI Breakthrough in Breast Cancer Risk Assessment

contact our team.