This can be a visitor weblog submit co-authored with Atul Khare and Bhupender Panwar from Salesforce.
Headquartered in San Francisco, Salesforce, Inc. is a cloud-based buyer relationship administration (CRM) software program firm constructing synthetic intelligence (AI)-powered enterprise functions that enable companies to attach with their prospects in new and personalised methods.
The Salesforce Belief Intelligence Platform (TIP) log platform staff is liable for knowledge pipeline and knowledge lake infrastructure, offering log ingestion, normalization, persistence, search, and detection functionality to make sure Salesforce is protected from risk actors. It runs miscellaneous providers to facilitate investigation, mitigation, and containment for safety operations. The TIP staff is important to securing Salesforce’s infrastructure, detecting malicious risk actions, and offering well timed responses to safety occasions. That is achieved by accumulating and inspecting petabytes of safety logs throughout dozens of organizations, some with 1000’s of accounts.
On this submit, we focus on how the Salesforce TIP staff optimized their structure utilizing Amazon Internet Providers (AWS) managed providers to realize higher scalability, price, and operational effectivity.
TIP present structure chicken’s eye view and scale of the platform
The principle key efficiency indicator (KPI) for the TIP platform is its functionality to ingest a excessive quantity of safety logs from quite a lot of Salesforce inside techniques in actual time and course of them with excessive velocity. The platform ingests greater than 1 PB of knowledge per day, greater than 10 million occasions per second, and greater than 200 completely different log varieties. The platform ingests log information in JSON, textual content, and Widespread Occasion Format (CEF) codecs.
The message bus in TIP’s present structure primarily makes use of Apache Kafka for ingesting completely different log varieties coming from the upstream techniques. Kafka had a single matter for all of the log varieties earlier than they have been consumed by completely different downstream functions together with Splunk, Streaming Search, and Log Normalizer. The Normalized Parquet Logs are saved in an Amazon Easy Storage Service (Amazon S3) knowledge lake and cataloged into Hive Metastore (HMS) on an Amazon Relational Database Service (Amazon RDS) occasion primarily based on S3 occasion notifications. The information lake customers then use Apache Presto operating on Amazon EMR cluster to carry out one-time queries. Different groups together with the Information Science and Machine Studying groups use the platform to detect, analyze, and management safety threats.
Challenges with the present TIP log platform structure
A few of the primary challenges that TIP’s present structure was dealing with embrace:
- Heavy operational overhead and upkeep price managing the Kafka cluster
- Excessive price to serve (CTS) to fulfill rising enterprise wants
- Compute threads restricted by partitions’ numbers
- Tough to scale out when visitors will increase
- Weekly patching creates lags
- Challenges with HMS scalability
All these challenges motivated the TIP staff to embark on a journey to create a extra optimized platform that’s simpler to scale with much less operational overhead and decrease CTS.
New TIP log platform structure
The Salesforce TIP log platform engineering staff, in collaboration with AWS, began constructing the brand new structure to interchange the Kafka-based message bus answer with the absolutely managed AWS messaging and notification options Amazon Easy Queue Service (Amazon SQS) and Amazon Easy Notification Service (Amazon SNS). Within the new design, the upstream techniques ship their logs to a central Amazon S3 storage location, which invokes a course of to partition the logs and retailer them in an S3 knowledge lake. Client functions akin to Splunk get the messages delivered to their system utilizing Amazon SQS. Equally, the partitioned log knowledge by means of Amazon SQS occasions initializes a log normalization course of that delivers the normalized log knowledge to open supply Delta Lake tables on an S3 knowledge lake. One of many main adjustments within the new structure is using an AWS Glue Information Catalog to interchange the earlier Hive Metastore. The one-time evaluation functions use Apache Trino on an Amazon EMR cluster to question the Delta Tables cataloged in AWS Glue. Different client functions additionally learn the info from S3 knowledge lake information saved in Delta Desk format. Extra particulars on a few of the essential processes are as follows:
Log partitioner (Spark structured stream)
This service ingests logs from the Amazon S3 SNS SQS-based retailer and shops them within the partitioned (by log varieties) format in S3 for additional downstream consumptions from the Amazon SNS SQS subscription. That is the bronze layer of the TIP knowledge lake.
Log normalizer (Spark structured stream)
One of many downstream customers of log partitioner (Splunk Ingestor is one other one), the log normalizer ingests the info from Partitioned Output S3, utilizing Amazon SNS SQS notifications, and enriches them utilizing Salesforce customized parsers and tags. Lastly, this enriched knowledge is landed within the knowledge lake on S3. That is the silver layer of the TIP knowledge lake.
Machine studying and different knowledge analytics customers (Trino, Flink, and Spark Jobs)
These customers devour from the silver layer of the TIP knowledge lake and run analytics for safety detection use circumstances. The sooner Kafka interface is now transformed to delta streams ingestion, which concludes the entire removing of the Kafka bus from the TIP knowledge pipeline.
Benefits of the brand new TIP log platform structure
The principle benefits realized by the Salesforce TIP staff primarily based on this new structure utilizing Amazon S3, Amazon SNS, and Amazon SQS embrace:
- Price financial savings of roughly $400 thousand per 30 days
- Auto scaling to fulfill rising enterprise wants
- Zero DevOps upkeep overhead
- No mapping of partitions to compute threads
- Compute sources will be scaled up and down independently
- Absolutely managed Information Catalog to cut back the operational overhead of managing HMS
Abstract
On this weblog submit we mentioned how the Salesforce Belief Intelligence Platform (TIP) optimized their knowledge pipeline by changing the Kafka-based message bus answer with absolutely managed AWS messaging and notification options utilizing Amazon SQS and Amazon SNS. Salesforce and AWS groups labored collectively to ensure this new platform seamlessly scales to ingest greater than 1 PB of knowledge per day, greater than 10 thousands and thousands occasions per second, and greater than 200 completely different log varieties. Attain out to your AWS account staff when you’ve got related use circumstances and also you need assistance architecting your platform to realize operational efficiencies and scale.
Concerning the authors
Atul Khare is a Director of Engineering at Salesforce Safety, the place he spearheads the Safety Log Platform and Information Lakehouse initiatives. He helps numerous safety prospects by constructing sturdy massive knowledge ETL pipeline that’s elastic, resilient, and simple to make use of, offering uniform & constant safety datasets for risk detection and response operations, AI, forensic evaluation, analytics, and compliance wants throughout all Salesforce clouds. Past his skilled endeavors, Atul enjoys performing music along with his band to boost funds for native charities.
Bhupender Panwar is a Large Information Architect at Salesforce and seasoned advocate for giant knowledge and cloud computing. His background encompasses the event of data-intensive functions and pipelines, fixing intricate architectural and scalability challenges, and extracting invaluable insights from intensive datasets inside the expertise trade. Exterior of his massive knowledge work, Bhupender likes to hike, bike, take pleasure in journey and is a good foodie.
Avijit Goswami is a Principal Options Architect at AWS specialised in knowledge and analytics. He helps AWS strategic prospects in constructing high-performing, safe, and scalable knowledge lake options on AWS utilizing AWS managed providers and open-source options. Exterior of his work, Avijit likes to journey, hike within the San Francisco Bay Space trails, watch sports activities, and take heed to music.
Vikas Panghal is the Principal Product Supervisor main the product administration staff for Amazon SNS and Amazon SQS. He has deep experience in event-driven and messaging functions and brings a wealth of data and expertise to his position, shaping the way forward for messaging providers. He’s keen about serving to prospects construct extremely scalable, fault-tolerant, and loosely coupled techniques. Exterior of labor, he enjoys spending time along with his household open air, taking part in chess, and operating.