What is DSBulk?
DSBulk is an amazing tool we use here at DataStax to both unload and load data from Cassandra. I recently had the opportunity to work with a very large customer to evaluate Astra. We used DSBulk to unload from On-Prem DSE and load to Astra on GCP. You can find official information for Astra DSBulk here.
How Do You Use DSBulk With Astra?
The first thing we need to do is learn the dsbulk unload command:
dsbulk unload -url [outputfile].csv -k [keyspace] -t [table] -b "~/Downloads/secure-connect-bundle.zip" -u [clientID] -p [clientSecret]
This example is for unload from Astra, to unload from another cluster, just include -u, -p, and -h (host), -port, and remove -b (bundle). The official dsbulk manual is found here.
Next we need to learn the dsbulk load command:
dsbulk load -url [outputfile].csv -k [keyspace] -t [table] -b "~/Downloads/secure-connect-bundle.zip" -u [clientID] -p [clientSecret] -header true
Now that we have a working example of each, lets create a script which will iterate through a list of tables and execute each command.
#!/bin/ksh
for table in $(cat table.list)
do
echo "##### working on $table ######"
sudo -E ./dsbulk-1.8.0/bin/dsbulk load -url dump/${table}.csv -k [keyspace] -t ${table} -b ~/secure-connect-bundle.zip -u [clientID] -p [clientSecret] -header true
done
This is a sample load, unload will require only slight modifications.
What Does My Loading Environment Look Like
In order to speed up the process of loading data I uploaded a tar.gz of the unloaded csv to a GCP VM (4vcpu and 16gb RAM) which is more local to my us-east1 Astra Cluster. With this VM setup these are all of my terminal commands to prepare the server.
sudo apt install python python-pip
sudo apt install rsync
sudo apt-get update
sudo apt-get install software-properties-common
sudo apt install software-properties-common
sudo dpkg --configure -a
sudo apt install software-properties-common
sudo apt-add-repository 'deb http://security.debian.org/debian-security stretch/updates main'
sudo apt-get update
sudo apt-get install openjdk-8-jdk
I had my source data, and dsbulk in my local env a quick rysnc gets that to my VM:
rsync -e ssh -azvp dump.tar.gz stevenmatison@[GCPVMIP]:
rsync -e ssh -azvp dsbulk-1.8.0.tar.gz stevenmatison@[GCPVMIP]:
What Are The Lessons Learned?
- Some of our tables had no records and reported back an error:
At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.
- Some tables needed the following value added for large column sizes:
--connector.csv.maxCharsPerColumn 10000000
- Some tables needed a default value argument:
--connector.csv.nullValue null
- DSBulk is an amazing tool with very powerful set of features!!
What’s Next?
Now that I have all my DSE tables loaded to Astra we can begin to test our application against Astra. If you are looking for more reference examples for DSBulk check out msmygit’s repo for a much deeper dive into DSBulk.
How can I help you with Astra?
Find me over on the Astra Slack to ask me any questions about Astra. Also let’s chat if you have something kewl you did with Astra and you want me to feature it in my blog. Look below or to the right for more ways to find me.