Evaluating AWS Glue for Continuous CSV-to-JSON Conversion
MTV News, in collaboration with Cirit, is hosting an election results application. Both companies share a commitment to improving the efficiency of the service, particularly in the conversion of CSV files to JSON format - a critical aspect for delivering timely updates during the Finnish elections. With a focus on boosting data processing speed, AWS Glue was considered as a potential solution for optimizing this conversion process.
The Promise and Reality
AWS Glue is a fully managed ETL (Extract, Transform, Load) service from Amazon Web Services, designed to streamline data preparation and transformation tasks. It offers a graphical interface intended to simplify the creation of ETL workflows, and we’ve noticed it works well for ad-hoc, one-time conversions.
However, the reality of using AWS Glue falls short when dealing with non-standard use cases or performance-critical tasks.
While the UI is easy to navigate, its limited modifiability can make handling complex tasks cumbersome. Moreover, the service is prone to job failures, often due to IAM role errors, and its data preview feature is quite slow. For use cases requiring near real-time processing, these issues present challenges.
Performance Limitations
AWS Glue struggles especially with large datasets or specific data formats like ISO-8859-1, which requires conversion to UTF-8 before processing. Default performance can be disappointingly slow, with tasks taking minutes to initiate and execute. These delays are problematic for near real-time data processing needs, where every second counts.
Despite these challenges, AWS Glue has notable strengths that make it well-suited for use cases where data processing jobs require extensive transformations and can tolerate longer execution times. For workflows where the ETL process traditionally takes hours, the ability to reduce these times to just 10-20 minutes represents a significant improvement. This capability is particularly valuable for batch processing tasks, data warehousing, and preparing large datasets for analytics, where real-time processing is not as critical.
Conclusion: Is AWS Glue Suitable for Near Real-Time Data Processing?
After our evaluation, AWS Glue was deemed too slow for election results service, where new data arrives every five minutes and conversion needs to be completed in just a few tens of seconds. The processing delays introduced by Glue made it unsuitable for our needs, leading us to explore alternative solutions. While Glue is a valuable tool for broader data processing strategies, it may not be the best choice for use cases demanding fast and continuous data processing.