Interacting with TigerGraph’s AMLSim Fraud Detection Graph with pySpark
Note: You can find a similar blog post on integrating Spark and TigerGraph here. This blog will walk through similar content but will use pySpark instead of Spark. The set up—most of Part I and all of Part II—will be the same, so you can either skip or breifly skim over those sections if you read the other blog.
Apache Spark is a data processing software known for its ability to process large datasets and distribute data processing tasks among multiple computers. PySpark is the Python API for Spark and allows Python users to manipulate the data using Spark all in Python. TigerGraph is an enterprise-scale graph database. Together, Spark and PySpark’s abilities with TigerGraph’s graph database can be used to manipulate and analyze big data to make big discoveries.
This tutorial will walk through the basics of getting started with both pySpark and TigerGraph and using TigerGraph’s JDBC driver to connect the two.
- Scala (v2.12.15) → Spark is written in Scala, so Scala will need to be downloaded.
- Java (v16.0.1) → This should already be installed on the computer. Java will be used to process the driver.
- TigerGraph On-Premise (v3.6.0) → TigerGraph on-premise will be used to run and host the database.
- TigerGraph JDBC Driver (v1.3.0) → This will connect Spark with TigerGraph.
- Apache Spark (v3.2.1) → The primary data processing software used.
- Python3 (v3.8.9) → The primary language used for this project is Python.
- Pip (v22.1.2) → Pip is a Python package installer and an alternate method to install pySpark.
- pySpark (v3.2.1) → This is the Python API for Spark that will be used to interact with TigerGraph.
- Docker Desktop (v184.108.40.20615) → Docker will be used to run TigerGraph on-premise.
The full code for this project can be found here.
- Create the AML Sim Graph
- Create the Project
- Read TigerGraph Data with PySpark
- Resources and Next Steps
To start off, we will download Spark and Scala using brew, a software that can be used on Linux and Mac OS. Spark is the data manipulation software we will use, and Spark itself is built on Scala.
brew install scala && brew install apache-spark
To verify Spark and Scala were downloaded properly, open the Spark shell with:
If the shell runs without any errors, Spark is successfully downloaded!
PySpark should get automatically downloaded when installing Spark. To verify it is installed, run
pyspark to open the Python Spark shell.
Note: Notice that, the previous shell, this Python shell is preceded with
If it is not downloaded, pySpark can also be downloaded via pip. To do so, create a virtual environment and pip install pySpark.
python3 -m venv venvsource venv/bin/activatepip install pyspark
Download the latest JDBC driver from Maven.
At the time of publication, the most recent driver is 1.3.0, so that will be used for this blog. Hold on to the
tigergraph-jdbc-driver-1.3.0.jar file downloaded; we will use it when creating the repository.
TigerGraph can be downloaded via Docker. The full instructions to download it can be found here.
In summary, start Docker on your computer (by downloading the Desktop app and running the application). Next, create a folder on your machine to hold the Docker data.
chmod 777 data
Finally, run the TigerGraph Docker image.
docker run -d -p 14022:22 -p 9000:9000 -p 14240:14240 --name tigergraph --ulimit nofile=1000000:1000000 -v ~/data:/home/tigergraph/mydata -t docker.tigergraph.com/tigergraph:latest
To enter the TigerGraph shell, you can ssh into the solution.
ssh -p 14022 tigergraph@localhost
At the password prompt, enter
Perfect! Your TigerGraph box is ready to load a graph!
To begin, run
gadmin start all in the TigerGraph shell. This will start all TigerGraph services.
gadmin start all
The graph this tutorial will load is the AML Sim Graph. To start off, clone the repository.
git clone https://github.com/TigerGraph-DevLabs/AMLSim_Python_Lab.gitcd AMLSim_Python_Lab
This repository contains GSQL files of the documents to create the graph.
To build the schema, run
This will create the AMLSim graph.
Next, to load the data, run the three loading files.
cd db_scripts/loadgsql load_job_accounts.gsql
gsql load_job_transactions.gsqlcd ../..
This will load the CSV data in the
data folder to the graph.
Finally, install queries to the graph database. This tutorial will only walk through an example of running one query, but you may install as many as you would like.
Exploring in GraphStudio
To view the AMLSim graph visually, you can open http://localhost:14240/ to view GraphStudio. There, you can access the schema, view the stats of the loaded data, and run queries.
Unlike Scala and Java, Python Scala projects do not need a specific format type. All of the files for this project will be located in the same folder.
Note: The version and naming syntax of the
.jardriver can change; just ensure it is the file downloaded from Maven.
index.py by importing SparkSession from pySpark.
from pyspark.sql import SparkSession
Next, set the
spark variable using
spark = SparkSession.builder .appName("TigerGraphAnalysis") .config("spark.driver.extraClassPath", "/usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*:tigergraph-jdbc-driver-1.3.0.jar") .getOrCreate()
appName is simply the name of the project—in my case, “TigerGraphAnalysis.” The
.config sets the driver’s path. This input for this contains the directory to the Spark jars and the path to the driver. If you installed Spark via Homebrew, the location will be like the above,
/usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*. Next, if the driver is in the same folder as the file, you can simply enter the filename,
tigergraph-jdbc-driver-1.3.0.jar. Finally, combine the two paths with a colon and enter it in the config:
Run the Project
To run the project,
spark-submit must be used to run the file and the driver must be specified with the
spark-submit --jars tigergraph-jdbc-driver-1.3.0.jar index.py
With that, we are set to begin interacting with the AMLSim graph!
In general, pySpark’s syntax is similar to Spark’s syntax.
jdbcDF1 = spark.read .format("jdbc") .option("driver", "com.tigergraph.jdbc.Driver") .option("url", "jdbc:tg:http://127.0.0.1:14240") .option("user", "tigergraph") .option("password", "tigergraph") .option("graph", "AMLSim") .option("dbtable", "vertex Transaction") .option("limit", "1000") .option("debug", "0") .load()jdbcDF1.show()
To start, reading vertices contain options to specify the usage of the JDBC driver and options to specify graph information and credentials. To select a vertex, the
dbtable must be
vertex followed by the vertex type—in this case,
limit specifies the maximum number of vertices to return.
jdbcDF2 = spark.read .format("jdbc") .option("driver", "com.tigergraph.jdbc.Driver") .option("url", "jdbc:tg:http://127.0.0.1:14240") .option("user", "tigergraph") .option("password", "tigergraph") .option("graph", "AMLSim") .option("dbtable", "edge RECEIVE_TRANSACTION") .option("limit", "1000") .option("source", "9934") .option("debug", "0") .load()jdbcDF2.show()
The function to read edges is similar. The
dbtablein this case, is
edge followed by the edge type—for this example,
RECIEVE_TRANSACTION. Once again,
limit is the maximum number of edges. Finally,
source is the id of the source vertex of the edge. For this example, the source vertex is a Transaction vertex with the id of 9934.
There is only one edge connected to Transaction 9934.
jdbcDF3 = spark.read .format("jdbc") .option("driver", "com.tigergraph.jdbc.Driver") .option("url", "jdbc:tg:http://127.0.0.1:14240") .option("user", "tigergraph") .option("password", "tigergraph") .option("graph", "AMLSim") .option("dbtable", "query selectAccountTx(acct=9934)") .option("debug", "0") .load()jdbcDF3.show()
Finally, to run queries, the
query followed by the name of the installed query and the query’s parameters. In the example, the query
selectAccountTx is run to find all transactions to and from the account with id 9934.
Congrats on this blog! If you made it this far, you can now interact with TigerGraph via Spark. To find the full code, check out the GitHub repository here.
In addition, check out the TigerGraph-Spark connector documentation here.
Finally, check out the introduction to Spark tutorials here.
From here, you can continue to explore pySpark and TigerGraph to create more complex projects. For example, integrating Spark’s MLlib and TigerGraph’s Graph Data Science Algorithms to detect fraudulent transactions could be an awesome way to couple both software’s strengths.
Good luck exploring and building projects! If you have any questions or simply want to chat with other developers or show off some of your projects, feel free to join the TigerGraph Discourse and Discord!