Benchmarking MongoSluice: Streaming Yelp’s Business Data to MySQL
MongoSluice’s ultimate goal is to accurately portray NoSQL data from MongoDB in SQL format for simple analysis. In order to generate a perfect representation of the data, every single document within a collection needs to be checked in order to be 100% sure that all the correct fields are generated. It is up to developers and data analysts to decide which data is relevant.
With this high degree of accuracy, speed is not the first priority. However, given the complexity of its tasks, MongoSluice performs rather well.
In order to test MongoSluice’s speed, we used a 140 MB JSON dataset provided by Yelp called yelp_dataset_business.json that consisted of 188,593 documents in MongoDB.
The Hardware and Software
Here is the specs of our hardware running as separate Digital Ocean Droplets:
- The MongoDB Droplet: Ubuntu 4.0.2 with 16 GB Memory; 6 vCPUs; and 320 GB of disk space
- The MySQL Droplet: Ubuntu 4.0.2; 4 GB Memory; 2 vCPUs; and 80 GB of disk space
- The MongoSluice Droplet: Ubuntu 4.0.2; 4 GB Memory; 2 vCPUs; and 80 GB of disk space
Here is the time that MongoSluice took to process the data:
- Total Time: 39 minutes
- Generating schema: 17 minutes
- Streaming data: 22 minutes
Here is a look at the schema in MySQL workbench:
A convenient feature of MongoSluice is its ability to quickly sync changed or new data without doing any additional work such as investigating a schema…read more
MongoSluice is great at accurately Sluicing through complex data, but it is important to have a tool that is also built for speed. MongoSluice meets…read more
The Problem: How To Migrate MongoDB When A Field Has A Few Different Data Types. There are a couple tools out there that try to…read more