Show HN: Greenmask 0.2 – Database anonymization tool

github.com

63 points by woyten 13 hours ago

Hi! My name is Vadim, and I’m the developer of Greenmask (https://github.com/GreenmaskIO/greenmask). Today Greenmask is almost 1 year and recently we published one of the most significant release with new features: https://github.com/GreenmaskIO/greenmask/releases/tag/v0.2.0, as well as a new website at https://greenmask.io.

Before I describe Greenmask’s features, I want to share the story of how and why I started implementing it.

Everyone strives to have their staging environment resemble production as closely as possible because it plays a critical role in ensuring quality and improving metrics like time to delivery. To achieve this, many teams have started migrating databases and data from production to staging environments. Obviously this requires anonymizing the data, and for this people use either custom scripts or existing anonymization software.

Having worked as a database engineer for 8 years, I frequently struggled with routine tasks like setting up development environments—this was a common request. Initially, I used custom scripts to handle this, but things became increasingly complex as the number of services grew, especially with the rise of microservices architecture.

When I began exploring tools to solve this issue, I listed my key expectations for such software: documentation; type safety (the tool should validate any changes to the data); streaming (I want the ability to stream the data while transformations are being applied); consistency (transformations must maintain constraints, functional dependencies, and more); reliability; customizability; interactivity and usability; simplicity.

I found a few options, but none fully met my expectations. Two interesting tools I discovered were pganonymizer and replibyte. I liked the architecture of Replibyte, but when I tried it, it failed due to architectural limitations.

With these thoughts in mind, I began developing Greenmask in mid-2023. My goal was to create a tool that meets all of these requirements, based on the design principles I laid out. Here are some key highlights:

* It is a single utility - Greenmask delegates the schema dump to vendor utilities and takes responsibility only for data dumps and transformations.

* Database Subset (https://docs.greenmask.io/latest/database_subset) - specify the subset condition and scale down size. We did a deep research in graph algorithms and now we can subset almost any complexity of database.

* Database type safety - it uses the DBMS driver to decode and encode data into real types (such as int, float, etc.) in the stream. This guarantees consistency and almost eliminates the chance of corrupted dumps.

* Deterministic engine (https://docs.greenmask.io/latest/built_in_transformers/trans...) - generate data using the hash engine that produces consistent output for the same input.

* Dynamic parameters for transformers (https://docs.greenmask.io/latest/built_in_transformers/dynam...) - imagine having created_at and updated_at dates with functional dependencies. Dynamic parameters ensure these dates are generated correctly.

We are actively maintaining the current project and continuously improving it—our public roadmap at https://github.com/orgs/GreenmaskIO/projects/6. Soon, we will release a Python library along with transformation collections to help users develop their own complex transformations and integrate them with any service. We have plans to support more database engines, with MySQL being the next one, and we are working on tools which will integrate seamlessly with your CI/CD systems.

To get started, we’ve prepared a playground for you that can be easily set up using Docker Compose: https://docs.greenmask.io/latest/playground/

I’d love to hear any questions or feedback from you!

brody_slade_ai 3 hours ago

I've had my fair share of struggles with data anonymization. I've tried various techniques, from simple data masking to more complex approaches like pseudonymization and generalization.

However, I've found that these traditional methods often fall short in preserving the intricate relationships and structures within the data. That's why I was excited to discover synthetic data generation, which has been a game-changer for me. By creating artificial data that mirrors the statistical properties of the original, I've been able to share data with others without compromising sensitive information. I've used synthetic data generation for various projects and it's been a valuable tool in my toolkit.

  • ldng 39 minutes ago

    Any generation tool you suggest check-in out?

btown 11 hours ago

This is really awesome - and it's so amazing that you've build this as a standalone tool!

I can absolutely speak to the pain of having a dozen pg_dump --exclude-table-data arguments and having a developer experience that makes it difficult to reproduce bugs due to drift between production data and test fixtures (even if they share the same schema, assumptions can change massively!).

Secure and robust database cloning also enables preview apps that actually answer the stakeholder question "can I see/play with what the new code would do, if applied to the actual [document/record/product listing] that motivated the feature/bugfix?" Subsetting and PII masking are both critical for this, and it's amazing to see that you've thought about them as integral parts of the same product.

I really want to see a product like this succeed! The easier the tool is to use, the harder it might be to monetize... but there are so many applications of a tool like this, including ones that can materially improve security at organizations large and small (https://nabeelqu.substack.com/i/150188028/secrets just posted here earlier today remarks on this!) that I'm sure you'll find the right niche!

muhehe 2 hours ago

I liked similar thing, snaplet, unfortunately they're dead now. One thing I liked was the option to run proxy to which you could connect with any tool you like (psql, dbeaver, ...) and see preview of your transformations. Also they had some good (stable) generators for names, emails, etc...(I haven't yet checked this fully in greenmask).

Anyway, I will definitely try this. It looks real good!

imiric 10 hours ago

It's great seeing more tools in this space.

I was recently researching ways of anonymizing production data for staging, and I also found existing tools either cumbersome to setup or lacking in features.

I stumbled upon clickhouse-obfuscator[1], and really liked that it worked on standalone dump formats (CSV, Parquet, etc.) rather than any specific DBMS. I think that's a great approach for this, since it keeps things simple and generic, and it can be conveniently added as a middle step in the backup-restore pipeline. Unfortunately, the tool is quite barebones, and has issues maintaining referential integrity, so we had to abandon it.

This is still an unsolved problem in our team, so I'll keep an eye on your tool. We would need support for ClickHouse as well, so it's good you're planning support for other DBMSs. Good luck!

[1]: https://clickhouse.com/docs/en/operations/utilities/clickhou...

gregwebs 4 hours ago

Congrats on the release! I should be able to switch from datanymizer (unmaintained) now.

The other tool in this space to look at is neosync: https://www.neosync.dev/

  • edrenova 4 hours ago

    Thanks for the shout-out! Co-founder of Neosync here - love seeing more tools in this space and pushing the envelope further. Good luck!

jensenbox 9 hours ago

Having jumped from Replibyte to Greenmask already I can say it is a significantly better architecture - hands down.