In the modern business landscape, data lies at the heart of every organization's operations. From decision-making and predictive analytics to reporting and analysis, the applications of data are boundless. Data, in its final form, is presented to stakeholders to offer insights into an organization's internal and external workings.
Consider the realm of Artificial Intelligence (AI), where the accuracy of the assumptions made by AI models depends on the quality and integrity of the underlying data. AI algorithms leverage datasets provided by Data Engineers to train models at scale, and these models, in turn, generate business recommendations and predictions for stakeholders.
Ensuring data integrity is of paramount importance. The data must be reliable enough for AI models to derive meaningful insights.
The use of AI in conjunction with Kubernetes is gaining prominence. Workloads are increasingly migrating to Kubernetes, which treats AI models like microservices passing through a CI/CD pipeline. This approach leverages tools such as MLOps and Kubeflow. Source code is stored in Git repositories, packaged into container images, and deployed into Kubernetes as part of the data pipeline. This approach empowers organizations with significant control, allowing them to promote or roll back specific versions of AI models when issues arise, such as inaccuracies.
Additionally, this approach offers the ability to control dataset versions as if they were Git repositories. For example, when new data sources need to be incorporated into the data lake, organizations can branch the existing dataset, conduct unit and integration tests to ensure the new data source is safe, and then merge these changes back into the original dataset. This results in a well-structured and tested dataset that can be used by data scientists and Business Intelligence (BI) engineers.
Today, we'll explore how z1storage.com, in partnership with LakeFS, enables organizations to bring this version control concept to life, treating data lakes as Git repositories. We'll utilize Ceph S3 as the data lake, directly connected to LakeFS as an S3 block store.
Prerequisites
Before we begin, ensure you have the following prerequisites in place:
- A Ceph cluster with an exposed RadosgW S3 interface.
- Docker-compose installed on your computer.
Installation
Create a Radosgw User:
To interact with Ceph's S3 interface, create a user named 'lakefs' as follows:
css
radosgw-admin user create --uid=lakefs --display-name="Lakefs User" --access-key=lakefs --secret-key=lakefs
Configure Ceph's S3 in LakeFS:
Save Ceph's S3 configuration in Lakefs's config file to ensure LakeFS uses Ceph as its S3 block store:
bash
LAKEFS_CONFIG_FILE=./.lakefs-env echo "AWS_ACCESS_KEY_ID=lakefs" > $LAKEFS_CONFIG_FILE echo "AWS_SECRET_ACCESS_KEY=lakefs" >> $LAKEFS_CONFIG_FILE echo "LAKEFS_BLOCKSTORE_S3_ENDPOINT=http://192.168.1.53:8080" >> $LAKEFS_CONFIG_FILE echo "LAKEFS_BLOCKSTORE_TYPE=s3" >> $LAKEFS_CONFIG_FILE echo "LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true" >> $LAKEFS_CONFIG_FILE
Deploy LakeFS:
Deploy LakeFS, along with Postgres for data and metadata persistency, using Docker-compose:
bash
curl https://compose.lakefs.io | docker-compose --env-file $LAKEFS_CONFIG_FILE -f - up
Create an S3 Bucket:
Create an S3 bucket to be used by LakeFS to store versioned objects:
perl
aws s3 mb s3://lakefs/ --endpoint-url http://192.168.1.53:8080
Creating a Versioned Data Repository
Now that the infrastructure requirements are set up, let's create a versioned data repository:
bash
lakectl repo create lakefs://repo s3://lakefs -d main
This repository is configured to point to the S3 bucket created earlier. Verify the repository by listing it:
lakectl repo list
Uploading and Versioning Data
Let's create a simple text file and upload it to the main branch of our repository:
bash
echo "My name is Shon" >> txtfile.txt lakectl fs upload -s txtfile.txt lakefs://repo/main/txtfile.txt
Now that we have a version of our data in LakeFS, we can commit it:
arduino
lakectl commit -m "added one version of the txt file" lakefs://repo/main
To create a second version of the text file, add a new line to it:
bash
echo "And my last name is Paz" >> txtfile.txt
Create a new branch for this second version:
bash
lakectl branch create lakefs://repo/txtfile_change -s lakefs://repo/main
Upload the updated file to the new branch:
arduino
lakectl fs upload -s txtfile.txt lakefs://repo/txtfile_change/txtfile.txt
Now, commit this change to the second branch:
arduino
lakectl commit -m "added second version of the txt file" lakefs://repo/txtfile_change
Merging Versions Like a standard Git repository, LakeFS allows you to merge changes. Merge the updated file from the second branch into the main branch:
bash
lakectl merge lakefs://repo/txtfile_change lakefs://repo/main
This effectively merges the changes and keeps the data versioned.
Conclusion
This demonstration illustrates how z1storage.com and LakeFS can empower organizations by treating Ceph's S3 block store as a Git repository. This capability enhances data management and version control, ensuring the integrity of your data throughout its lifecycle. For more information and to explore the benefits for your organization, visit z1storage.com. Thank you for joining us, and see you next time!