diff --git a/docs/en/images/batch-create-datasource-failed.png b/docs/en/images/batch-create-datasource-failed.png new file mode 100644 index 00000000..e2137497 Binary files /dev/null and b/docs/en/images/batch-create-datasource-failed.png differ diff --git a/docs/en/images/batch-create-datasource-succeeded.png b/docs/en/images/batch-create-datasource-succeeded.png new file mode 100644 index 00000000..6ca5de16 Binary files /dev/null and b/docs/en/images/batch-create-datasource-succeeded.png differ diff --git a/docs/en/images/batch_create_datasource_enablemarcos.png b/docs/en/images/batch_create_datasource_enablemarcos.png new file mode 100644 index 00000000..38fcc31c Binary files /dev/null and b/docs/en/images/batch_create_datasource_enablemarcos.png differ diff --git a/docs/en/user-guide/data-catalog-create-jdbc-database-proxy.md b/docs/en/user-guide/data-catalog-create-jdbc-database-proxy.md deleted file mode 100644 index 7c5d3cf7..00000000 --- a/docs/en/user-guide/data-catalog-create-jdbc-database-proxy.md +++ /dev/null @@ -1,35 +0,0 @@ -# Connect to Data Sources - JDBC (Database Proxy) -When your RDS/database is in a private network and there are strict IP restrictions (only fixed IPs are allowed for access), you need to connect to the data source in this way. - -### Prerequisites - Maintain Network Connectivity -1. Please ensure when you [add an AWS account](data-source.md), choose the JDBC method, then proceed to [Connect to Data Source - RDS](data-catalog-create-jdbc-database-proxy.md) for operations. -2. Create a Database Proxy: Create an EC2 in the VPC where the solution resides to act as a proxy machine. Refer to the steps in: [Appendix: Creating a Database Proxy](appendix-database-proxy.md). -3. Add RDS to the whitelist: Add the EC2 IP to the Inbound Rule of the Security Group for the database to be scanned. - -## Connect to the Database Through EC2 Database Proxy (DB Proxy) -1. In the left menu, select **Connect Data Source** -2. Choose the cloud account you wish to scan, click to enter the account, and open the details page -3. Click to enter a cloud account, open the details page -4. Select the **Custom Database (JDBC)** tab -5. Click **Action**, **Add Data Source** - - | Parameter | Required | Description | - |------------------|----------|--------------------------------------------------------------------------------------------------| - | Instance Name | Yes | Database instance name | - | Check SSL Connection | No | Whether to connect via SSL | - | Description (Optional) | No | Instance description | - | JDBC URL (Required) | Yes | Fill in at least one database under the database instance for connection and scanning. Format: `jdbc:mysql://ec2_public_ip:port/databasename` | - | JDBC Databases | No | List all databases in this instance that need to be scanned, including the required database above. Click the button to "Auto Query Database List" | - | Credentials | Yes | Choose username and password or SecretManager. Enter the database username/password. | - | VPC | Yes | Select the VPC where the Proxy resides | - | Subnet | Yes | Select the subnet in the VPC where the Proxy resides | - | Security Group | Yes | Select the security group in the VPC where the Proxy resides | - !!! Info "Auto Retrieve Database Button" - The solution currently supports auto-retrieval of MySQL databases. - -6. Choose the database instance, click the button **Sync to Data Catalog** -7. You will see the catalog status turn to gray `PENDING`, indicating the connection is starting (about 3 minutes) -8. You will see the catalog status turn to blue `CRAWLING` (about 15 minutes for 200 tables) -9. When you see the catalog status turn to green `ACTIVE`, it means a data catalog has been created for the RDS instance. At this point, you can click on the corresponding data catalog link for a preliminary view and subsequent scanning tasks. - -At this point, you have established a connection to the RDS proxy's data source via JDBC and can proceed to the next steps. diff --git a/docs/en/user-guide/data-catalog-create-jdbc-redshift.md b/docs/en/user-guide/data-catalog-create-jdbc-redshift.md deleted file mode 100644 index e2650b9c..00000000 --- a/docs/en/user-guide/data-catalog-create-jdbc-redshift.md +++ /dev/null @@ -1,37 +0,0 @@ -# Connect to Data Sources - JDBC (Redshift) - -When you wish to perform a sensitive data scan on a Redshift Cluster, you can use Redshift's database as a data source. - -### Prerequisites - Maintain Network Connectivity -1. Please confirm that when you [add an AWS account](data-source.md), you choose the CloudFormation method. If you added an account using the JDBC method, please proceed to [Connect via EC2 Proxy](data-catalog-create-jdbc-rds-proxy.md) for operations. -2. Prepare Redshift connection credentials (username/password). - -!!! Info "How to Obtain Redshift Credentials" - DBAs or business teams create a read-only user for security audits. This user only needs read-only permissions: `GRANT SHOW VIEW, SELECT ON *.* TO 'reader'@'%'`. - -## Connect to Amazon Redshift Data Source -1. From the left menu, select **Connect Data Source** -2. Choose the **AWS Cloud** tab -3. Click to enter an AWS account and open its detail page -4. Select the **Custom Database (JDBC)** tab -5. Click **Action**, **Add Data Source** -6. In the popup window, enter Redshift credential information. (If you choose the Secret Manager method, you need to manage the username/password for this Redshift in Secret Manager in advance.) - - | Parameter | Required | Description | - |------------------|----------|-------------------------------------------------------------------------------------------------| - | Instance Name | Yes | Name of the database in the cluster | - | Check SSL Connection | No | Whether to connect via SSL | - | Description (Optional) | No | Description | - | JDBC URL (Required) | Yes | Fill in a Redshift database for connection and scanning. Format: `jdbc:redshift://url:port/databasename`. For example: `jdbc:redshift://sdp-uat-redshift.xxxxxxxxxx.us-east-1.redshift.amazonaws.com.cn:5439/dev`| - | JDBC Databases | No | Keep empty | - | Credentials | Yes | Choose username and password or SecretManager. Fill in the database username/password. The credentials can be obtained by a read-only user created by the DBA or business teams for the security team. This user only needs SELECT (read-only) permissions. | - | VPC | Yes | Select the VPC where Redshift is located | - | Subnet | Yes | Select the subnet in the VPC where Redshift is located | - | Security Group | Yes | Select the security group in the VPC where Redshift is located | - -7. Click **Connect**. You can wait 10s before closing this window. -8. You will see the catalog status turn to gray `PENDING`, indicating the connection is starting (about 3 minutes) -9. You will see the catalog status turn to blue `CRAWLING`. (about 15 minutes for 200 tables) -10. When you see the catalog status turn green `ACTIVE`, it means a data catalog has been created for the Redshift Cluster. - -At this point, you have successfully connected the Redshift data source and can proceed to the next step 👉 [Define Classification and Grading Templates](data-identifiers.md). diff --git a/docs/en/user-guide/data-catalog-create-jdbc.md b/docs/en/user-guide/data-catalog-create-jdbc.md index dd0ca137..a0c06407 100644 --- a/docs/en/user-guide/data-catalog-create-jdbc.md +++ b/docs/en/user-guide/data-catalog-create-jdbc.md @@ -1,54 +1,97 @@ # Connect to Data Sources - JDBC -When you want to scan a specific type of database for sensitive data, you can use the DB instance or databases as a data source. - -| Supported Database Types | -|---------------------------------| -| Amazon Redshift | -| Amazon Aurora | -| Microsoft SQL Server | -| MySQL | -| Oracle | -| PostgreSQL | -| Snowflake | -| Amazon RDS for MariaDB | - -### Prerequisites - Maintain Network Connectivity -1. Please confirm that when you [add an AWS account](data-source.md), you choose the CloudFormation method. If you added the account using the JDBC method, please go to [Connect via EC2 Proxy](data-catalog-create-jdbc-db-proxy.md) for operations. -2. Ensure that the inbound rule of the database to be scanned includes a self-reference of its security group. See [official documentation](https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html) for details. -3. Prepare Redshift connection credentials (username/password). - -!!! Info "How to Obtain Database Credentials" - DBAs or business teams create a read-only user for security audits. This user only needs read-only permissions: `GRANT SHOW VIEW, SELECT ON *.* TO 'reader'@'%'` - -## Connect to Amazon Redshift Data Source -1. From the left menu, select **Connect Data Source** -2. Choose the **AWS Cloud** tab -3. Click to enter an AWS account and open its detail page -4. Select the **Custom Database (JDBC)** tab -5. Click **Action**, **Add Data Source** -6. In the popup window, enter Redshift credential information. (If you choose Secret Manager, you need to manage the username/password for this Redshift in Secret Manager in advance.) - - | Parameter | Required | Description | - |------------------|----------|-------------------------------------------------------------------------------------------------| - | Instance Name | Yes | Database name | - | Check SSL Connection | No | Whether to connect via SSL | - | Description (Optional) | No | Instance description | - | JDBC URL (Required) | Yes | Fill in a database for connection and scanning. See the table below for the specific format. | - | JDBC Databases | No | If you want to display multiple databases in a data catalog, fill in the database list. For example, for one data catalog per database instance, you can fill in multiple databases under the instance. If you only want to scan one database under this instance, leave it empty. | - | Credentials | Yes | Choose username and password or SecretManager. Enter the database username/password. | - | VPC | Yes | Select the VPC where the database is located | - | Subnet | Yes | Select the subnet in the VPC where the database is located | - | Security Group | Yes | Select the security group in the VPC where the database is located | - -7. Click **Connect**. You can wait 10 seconds before closing this window. -8. You will see the catalog status turn to gray `PENDING`, indicating the connection is starting (about 3 minutes). -9. You will see the catalog status turn to blue `CRAWLING`. (about 15 minutes for 200 tables) -10. When you see the catalog status turn green `ACTIVE`, it means a data catalog has been created for the Redshift Cluster. - -At this point, you have successfully connected the Redshift data source and can proceed to the next step 👉 [Define Classification and Grading Templates](data-identifiers.md). - -!!! Info "JDBC URL Formats and Examples" +When you wish to perform sensitive data scans on a particular type of database, you can use DB instances or databases as your data sources. + +First, please ensure that when you [add an AWS account](data-source.md), you select the CloudFormation method. If you added the account using the JDBC method, please proceed to [Connect to the Database via EC2 Proxy](data-catalog-create-jdbc-database-proxy.md). + +Currently supported JDBC data sources: + +| Supported Database Types | +|--------------------------| +| Amazon Redshift | +| Amazon Aurora | +| Microsoft SQL Server | +| MySQL | +| Oracle | +| PostgreSQL | +| Snowflake | +| Amazon RDS for MariaDB | + +### Prerequisites - Ensure Network Connectivity + +1. Please ensure that the inbound rule for the database you want to scan has self-reference in the security group. For detailed steps, refer to the [official documentation](https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html). +2. Have the database connection credentials ready (username/password). + +!!! Info "How to get JDBC Credentials" + DBA or the business unit creates a read-only user (User) for security auditing purposes. Grant this user read-only permissions: `GRANT SHOW VIEW, SELECT ON *.* TO 'reader'@'%'`; + + +## Connect to a Single JDBC Data Source +1. From the left menu, select **Connect Data Source**. +2. Choose the **AWS Cloud** tab. +3. Click on an AWS account to open the detailed page. +4. Select the **Custom Database (JDBC)** tab. +5. Click **Actions**, **Add Data Source**. +6. In the pop-up window, enter the database credential information. (If you choose the Secret Manager method, you need to host the username/password in Secret Manager beforehand.) + + | Parameter | Required | Parameter Description | + |--------------------|----------|--------------------------------------------------------------------------------------------------------------------| + | Instance Name | Yes | Database name | + | Enable SSL | No | Whether to connect via SSL | + | Description (Optional) | No | Instance description | + | Database Type | Yes | Choose between MySQL or other. If MySQL is selected, the solution supports automatic querying of databases in the instance. If other, you need to manually add the DB list. | + | JDBC URL (Required)| Yes | Fill in a database to connect and scan. See the "JDBC URL Format and Examples" section at the bottom of this article for specific format. | + | JDBC Databases | No | If you want to display multiple databases in a data catalog, enter a list of databases. For example, if one data catalog corresponds to one database instance, you can enter multiple databases under instance. If you only want to scan one database under this instance, keep it blank. | + | Credentials | Yes | Choose username/password or SecretManager. Fill in the database's username/password. | + | VPC | Yes | Select the VPC where the database is located | + | Subnet | Yes | Select the VPC subnet where the database is located | + | Security Group | Yes | Select the VPC security group where the database is located | + +7. Click **Authorize**. You can close this window after waiting for 10s. +8. You'll see the directory status change to blue `AUTHORIZED`. This also means that in the SDP backend, AWS Glue has successfully created a Crawler. + +**You have now connected to this data source via JDBC 🎉. You can proceed to the next step to [Define Classification and Grading Templates](data-identifiers.md).** + +Once you have configured the classification template and completed the sensitive data discovery task: + +- If the task is successful: You will see the directory status on this data source page turn green `ACTIVE`, indicating that the data directory has been created for this data. +- If the task fails: You will see the directory status on this data source page turn gray `Error message`, and you can hover over the error to see the specific information. + +## Bulk Automatic Creation of JDBC Data Sources + +If you have many data sources and adding them one by one in the UI is not convenient, you can use this bulk creation feature. + +### Step 1: Download Template +On the AWS account management page, click on the **Bulk Create** button. +On the bulk operation page, first download the "Bulk Create Data Sources" template (.xlsm). + +### Step 2: Edit the Template File +Open this file with Microsoft Excel. Excel software will prompt, "Do you need Enabled Macros?" Choose Enable. +![edit-icon](docs/../../images/batch_create_datasource_enablemarcos.png) + +Enter the data sources you need to scan, and it is recommended to do it in small batches (for easier error checking). + +| Instance Name | SSL | Description | JDBC URL | JDBC Databases | SecretARN | Username | Password | AccountID | Region | ProviderID | +|---------------------|-----|--------------------------|---------------------------------------------------------|----------------|-----------|----------|------------|---------------|----------------|------------| +| test-instance-7001 | 1 | xxxx1.sql.db.com:23297 | jdbc:mysql://172.31.48.6:7001 | | | root | Temp123456! | 123456789 | ap-guangzhou-1 | 1 | +| test-instance-7002 | 1 | xxxx2.sql.db.com:3306 | jdbc:mysql://172.31.48.6:7002 | | | root | Temp123456! | 123456789 | ap-guangzhou-1 | 1 | + + +## Connect to Data Sources via Database Proxy + +When your RDS/database is in a private network and strict IP restrictions apply (only allowing fixed IPs to access), you need to connect to data sources this way. + +1. Create a database proxy: Create an EC2 as a proxy machine in the VPC where the solution is located. Refer to the detailed steps in [Appendix: Create and Configure Database Proxy](appendix-database-proxy.md). +2. When configuring the Proxy, configure the Nginx steps. Refer to the detailed steps in [Appendix: Create and Configure Database Proxy](appendix-database-proxy.md). +3. When creating the JDBC data source, + - For the Description field, it is recommended to fill in the actual database address. + - For the JDBC URL field, fill in `jdbc:mysql://ec2_public_ip:port/databasename`. + - Fill in the Provider field with 4. (Required for batch creation template) + +-- + +### Parameters for Creating Data Sources +JDBC URL Format and Examples | JDBC URL | Example | |-------------------------------------------------|----------------------------------------------------------------------------------------------| @@ -61,3 +104,13 @@ At this point, you have successfully connected the Redshift data source and can | Amazon RDS for MariaDB | `jdbc:mysql://xxx-cluster.cluster-xxx.aws-region.rds.amazonaws.com:3306/employee` | | Snowflake (Standard Connection) | `jdbc:snowflake://account_name.snowflakecomputing.com/?user=user_name&db=sample&role=role_name&warehouse=warehouse_name` | | Snowflake (AWS PrivateLink Connection) | `jdbc:snowflake://account_name.region.privatelink.snowflakecomputing.com/?user=user_name&db=sample&role=role_name&warehouse=warehouse_name` | + + +Provider Parameter (used for batch creation): + +| Provider | Provider Id | Description | +|------------|-------------|-----------------------------------| +| AWS | 1 | AWS (Installed method: CloudFormation) | +| Tencent | 2 | Tencent account | +| Google | 3 | Google account | +| AWS(JDBC Only) | 4 | AWS (Installed method:JDBC Only) | diff --git a/docs/en/user-guide/discovery-job-create.md b/docs/en/user-guide/discovery-job-create.md index c7ba5b72..58ca5ff0 100644 --- a/docs/en/user-guide/discovery-job-create.md +++ b/docs/en/user-guide/discovery-job-create.md @@ -1,47 +1,51 @@ -You can create and manage jobs to detect sensitive data. Discovery jobs consist of one or more AWS Glue jobs for actual data detection. For more information, see [Viewing Job Details](discovery-job-details.md). +You can create and manage jobs for detecting sensitive data. Discovery jobs consist of one or more AWS Glue jobs used for actual data detection. For more information, please see [View Job Details](discovery-job-details.md). -## Create Discovery Jobs +## Create Discovery Job -1. In the left menu, select **Run Sensitive Data Discovery Job**. -2. Choose **Create Sensitive Data Discovery Job**. -![edit-icon](docs/../../images/job-list-cn.png) +From the left menu, select **Run Sensitive Data Discovery Job**. Click **Create Sensitive Data Discovery Job**. - - **Step 1: Select Data Source** +**Step 1**: Choose Data Source - | Provider | Data source | - |----------|-------------| - | AWS | S3, RDS, Glue, JDBC | - | Tencent | JDBC | - | Google | JDBC | +| Provider | Data source | +|----------------|--------------------| +| AWS | S3, RDS, Glue, Custom databases,Proxy databases | +| Tencent | JDBC | +| Google | JDBC | - - **Step 2: Job Settings** +!!! Info "What are AWS CustomDB and ProxyDB?" + - If you are scanning within your own account and connecting to a JDBC data source, select CustomDB + - If you added your account using CloudFormation and connected to a JDBC data source, select Custom databases + - If you added your account using JDBC Only and connected to a JDBC data source, select Proxy databases - | Job Setting | Description | Options | - |--------------------------|-------------|---------| - | Scan Frequency | Indicates the scan frequency of the discovery job. | On-demand
Daily
Weekly
Monthly | - | Scan Depth | Indicates the number of sample rows. | 100 (Recommended)
10, 30, 60, 100, 300, 500, 1000 | - | Scan Depth - Unstructured Data | Applies only to S3, samples the number of unstructured files in different folders. | Can Skip, 10 files, 30 files, All files | - | Scan Scope | Defines the overall scan scope of the target data source.
"Full Scan" scans all target data sources.
"Incremental Scan" skips data sources unchanged since the last data catalog update. | Full Scan
Incremental Scan (Recommended) | - | Detection Threshold | Defines the job's tolerance level required. If the scan depth is 1000 rows, a 10% threshold means that if more than 100 rows (out of 1000) match the identifier rules, the column will be marked as sensitive. A lower threshold indicates lower tolerance for sensitive data. | 10% (Recommended)
20%
30%
40%
50%
100% | - | Override Manual Privacy Labels | Choose whether to allow this job to use job results to override privacy labels in the data catalog. | Do Not Override (Recommended)
Override | +**Step 2**: Select specific database to be scanned - - **Step 3: Advanced Configuration** - - - **Step 4: Job Preview** +**Step 3**: Job Settings -3. After previewing the job, select **Run Job**. +| Job Setting | Description | Options | +| --- | --- | --- | +| Scan Frequency | Indicates the scan frequency of the discovery job. | On-demand
Daily
Weekly
Monthly | +| Sampling Depth | Indicates the number of sampled rows. | 100 (Recommended)
10, 30, 60, 100, 300, 500, 1000 | +| Sampling Depth - Unstructured Data | Applicable only to S3, the number of unstructured files sampled per folder. | Skip, 10 files, 30 files, All files | +| Scan Scope | Defines the overall scan scope of the target data source.
"Comprehensive Scan" means scanning all target data sources.
"Incremental Scan" means skipping data sources unchanged since the last data directory update. | Comprehensive Scan
Incremental Scan (Recommended) | +| Detection Threshold | Defines the tolerance level required for the job. If the sampling depth is 1000 rows, a threshold of 10% means if more than 100 rows (out of 1000 rows) match the identifier rules, the column will be flagged as sensitive. A lower threshold indicates lower tolerance. | 10% (Recommended)
20%
30%
40%
50%
100% | +| Overwrite Privacy Tags Manually Updated | Choose whether to allow the job to overwrite data directory privacy tags with job results. | Do Not Overwrite (Recommended)
Overwrite | -### About Incremental Scanning: -When the "Incremental Scan" setting is selected in the job, the scanning logic for S3 and RDS is slightly different, as follows: +**Step 4**: Advanced Configuration Items +**Step 5**: Job Preview + After previewing the job, select **Run Job**. -S3: When there is any change to an S3 object, the incremental scan will scan the Folder level of that path. +--- -- For example: there is 1 bucket with 3 folders, each containing a CSV file with a different schema. When the schema of the files in 1 folder is changed, during incremental scanning, the job will only scan the CSV files in that folder, not the other 2 folders. +### About Incremental Scan: +When choosing "Incremental Scan" in the job settings, the scan logic for S3 and RDS differs slightly as follows: -- For example: there is 1 bucket with 3 folders, each containing a CSV file with a different schema. When the schema of the files in 1 folder remains the same but the number of rows is increased or the file is updated in any way, during incremental scanning, the job will only scan the CSV files in that folder, not the other 2 folders. +S3: When any changes occur in S3 objects, the incremental scan scans at the folder level. -RDS: Only when there is a column-level change to an RDS table will the incremental scan scan that table. +- Example: With 1 bucket and 3 folders, each containing a CSV file (with different schemas), if the schema of one folder's file changes, the job will only scan the CSV files in that folder during incremental scanning, skipping the other 2 folders. -- For example: there is 1 RDS instance with 3 tables. When the schema of 1 table is changed (a column is added or deleted), during incremental scanning, only that table will be scanned, and the other two tables will be skipped. +- Example: With 1 bucket and 3 folders, each containing a CSV file (with different schemas), if there are no schema changes but additional rows are added or any file updates occur in one folder, the job will only scan the CSV files in that folder during incremental scanning, skipping the other 2 folders. -- For example: there is 1 RDS instance with 3 tables. When the schema of 1 table remains the same but rows are added or deleted, during incremental scanning, none of the 3 tables will be scanned. \ No newline at end of file +RDS: Only when there are column-level changes in RDS tables does the incremental scan scan the table. + +- Example: With 1 RDS instance and 3 tables, if the schema of one table changes (adding or deleting columns), the job will only scan that table during incremental scanning, skipping the other 2 tables. +- Example: With 1 RDS instance and 3 tables, if there are no schema changes but rows are added/deleted, none of these 3 tables will be scanned during incremental scanning. diff --git a/docs/mkdocs.en.yml b/docs/mkdocs.en.yml index d976a412..f1071307 100644 --- a/docs/mkdocs.en.yml +++ b/docs/mkdocs.en.yml @@ -32,8 +32,6 @@ nav: - Connect to RDS: user-guide/data-catalog-create-rds.md - Connect to Glue: user-guide/data-catalog-create-glue.md - Connect to JDBC: user-guide/data-catalog-create-jdbc.md - - Connect to JDBC(Redshift): user-guide/data-catalog-create-jdbc-redshift.md - - Connect to JDBC(RDS Proxy): user-guide/data-catalog-create-jdbc-database-proxy.md - Step2:Define classification template: user-guide/data-identifiers.md - Step3:Run sensitive data discovery jobs: - Create job: user-guide/discovery-job-create.md @@ -50,7 +48,7 @@ nav: - Appx.Permissions of CloudFormation Stacks: user-guide/appendix-permissions.md - Appx.Add accounts via AWS Organization: user-guide/appendix-organization.md - Appx.EU PII identifiers(GDPR reference): user-guide/appendix-build-in-identifiers-eu-gdpr.md - - Appx.Create database proxy: user-guide/appendix-database-proxy.md + - Appx.Create and config database proxy: user-guide/appendix-database-proxy.md - FAQ: faq.md - Troubleshooting: troubleshooting.md - Uninstall the solution: uninstall.md diff --git a/docs/mkdocs.zh.yml b/docs/mkdocs.zh.yml index 9e479d7c..c54f6d1f 100644 --- a/docs/mkdocs.zh.yml +++ b/docs/mkdocs.zh.yml @@ -33,8 +33,6 @@ nav: - 连接RDS: user-guide/data-catalog-create-rds.md - 连接Glue: user-guide/data-catalog-create-glue.md - 连接JDBC: user-guide/data-catalog-create-jdbc.md - - 连接JDBC(Redshift): user-guide/data-catalog-create-jdbc-redshift.md - - 连接JDBC(RDS Proxy): user-guide/data-catalog-create-jdbc-database-proxy.md - 第2步:定义数据分类模板: user-guide/data-identifiers.md - 第3步:运行敏感数据发现任务: - 创建作业: user-guide/discovery-job-create.md @@ -51,7 +49,7 @@ nav: - 附录:CloudFormation堆栈的权限: user-guide/appendix-permissions.md - 附录:通过AWS Organization添加帐户: user-guide/appendix-organization.md - 附录:EU个人信息标识符(GDPR参考): user-guide/appendix-build-in-identifiers-eu-gdpr.md - - 附录:创建数据库代理: user-guide/appendix-database-proxy.md + - 附录:创建、配置数据库代理: user-guide/appendix-database-proxy.md - 常见问题: faq.md - 故障排查: troubleshooting.md - 卸载解决方案: uninstall.md diff --git a/docs/zh/images/batch-create-datasource-failed.png b/docs/zh/images/batch-create-datasource-failed.png new file mode 100644 index 00000000..e2137497 Binary files /dev/null and b/docs/zh/images/batch-create-datasource-failed.png differ diff --git a/docs/zh/images/batch-create-datasource-succeeded.png b/docs/zh/images/batch-create-datasource-succeeded.png new file mode 100644 index 00000000..6ca5de16 Binary files /dev/null and b/docs/zh/images/batch-create-datasource-succeeded.png differ diff --git a/docs/zh/images/batch_create_datasource_enablemarcos.png b/docs/zh/images/batch_create_datasource_enablemarcos.png new file mode 100644 index 00000000..38fcc31c Binary files /dev/null and b/docs/zh/images/batch_create_datasource_enablemarcos.png differ diff --git a/docs/zh/user-guide/appendix-database-proxy.md b/docs/zh/user-guide/appendix-database-proxy.md index 95e1d5d6..51fbc5b5 100644 --- a/docs/zh/user-guide/appendix-database-proxy.md +++ b/docs/zh/user-guide/appendix-database-proxy.md @@ -1,5 +1,5 @@ -## 使用EC2配置数据库代理 -g +## 创建EC2数据库代理,配置Nginx转发到数据源 + ### 创建并登录到代理EC2机器,配置转发端口 有一些用户的数据库Security Group设置了限制,只允许固定IP访问。这个时候,用户需要一个EC2作为Proxy来提供固定的IP。 @@ -33,7 +33,8 @@ stream { } } ``` -!!! Info 数据库太多时,如何编辑配置文件? + +!!! Info "数据库太多时,如何编辑配置文件?" 如果您需要配置多个端口转发,可以使用SDP **批量创建数据源**功能,并通过模版来创建Nginx配置文件。见下面附录。 ##### Step 5: 重新加载配置文件 @@ -51,12 +52,12 @@ stream { 至此,您已经配置完代理服务器的配置,可以回到SDP UI上手动添加或者批量添加数据源了。 --- -### 附录:批量创建从代理服务器转发的数据源 +## 批量创建数据源时,生成Nginx配置文件 -##### Step 1: 下载模版 +### Step 1: 下载模版 从SDP UI上面,下载批量创建数据源的模版。 -##### Step 2: 编辑excel文件 +### Step 2: 编辑excel文件 填入您所需要扫描的数据源。 | InstanceName | SSL | Description | JDBC_URL | JDBC_Databases | SecretARN | Username | Password | AccountID | Region | ProviderID | @@ -64,9 +65,8 @@ stream { | test-instance-7001 | 1 | xxxx1.sql.db.com:23297 | jdbc:mysql://172.31.48.6:7001 | | | root | Temp123456! | 123456789 | ap-guangzhou-1 | 4 | | test-instance-7002 | 1 | xxxx2.sql.db.com:3306 | jdbc:mysql://172.31.48.6:7002 | | | root | Temp123456! | 123456789 | ap-guangzhou-1 | 4 | - -##### Step 3: 生成Nginx软件的config文件 -(在本地)打开excel软件,菜单栏点击 Tools → Marco → Visual Basic Editor 功能。 +### Step 3: 生成Nginx软件的config文件 +(在本地)打开模版文件(.xlsm)软件,菜单栏点击 Tools → Marco → Visual Basic Editor 功能。 点击运行按钮,会看到excel文件所在目录下生成一个config.txt文件。 diff --git a/docs/zh/user-guide/data-catalog-create-jdbc-database-proxy.md b/docs/zh/user-guide/data-catalog-create-jdbc-database-proxy.md deleted file mode 100644 index 86475d9e..00000000 --- a/docs/zh/user-guide/data-catalog-create-jdbc-database-proxy.md +++ /dev/null @@ -1,35 +0,0 @@ -# 连接到数据源 - JDBC(数据库代理) -当您的RDS/数据库在私有网络,且对于IP有严格的限制(只允许固定IP进行接入),您需要通过这种方式进行数据源连接。 - -### 前提条件 - 保持网络连通性 -1. 请确认您[添加AWS账户](data-source.md)时,选择JDBC方式,请转至[连接到数据源 - RDS](data-catalog-create-jdbc-database-proxy.md)进行操作。 -2. 创建数据库代理(Proxy):在方案所在VPC创建EC2作为代理机器,参考步骤详见:[附录:创建数据库代理](appendix-database-proxy.md)。 -3. 添加RDS访问白名单:将EC2的IP添加至待检测数据库的Security Group的Inbound Rule。 - -## 通过EC2数据库代理(DB Proxy)连接数据库 -1. 在左侧菜单,选择 **连接数据源** -2. 选择您所需要扫描的云账户,单击进入帐户,打开详细页面 -3. 单击进入一个云帐户,打开详细页面 -4. 选择 **自定义数据库(JDBC)** 标签页 -5. 点击**操作**,**添加数据源** - - | 参数 | 必填项 | 参数描述 | - |-------------------|--------|--------------------------------------------------------------------------------------------------------------------| - | 实例名称 | 是 | 数据库实例名称 | - | 勾选SSL连接 | 否 | 是否通过SSL连接 | - | 描述(选填) | 否 | 实例描述 | - | JDBC URL(必填) | 是 | 至少填写一个数据库实例下的database,用于连接和扫描。具体格式:`jdbc:mysql://ec2_public_ip:port/databasename` | - | JDBC数据库 | 否 | 填写此实例(instance)中所有需要扫描的数据库(databases)列表(包含上面必填的database)。点击按钮,“自动查询数据库列表” | - | 凭证 | 是 | 选择用户名密码或SecretManager。填写数据库的用户名/密码。 | - | VPC | 是 | 选择Proxy所在的VPC | - | 子网 | 是 | 选择Proxy所在的VPC子网 | - | 安全组 | 是 | 选择Proxy所在的VPC安全组 | - !!! Info "自动获取数据库按钮" - 方案目前支持自动获取MySQL数据库。 - -6. 选择数据库实例,点击按钮 **同步至数据目录** -7. 您看到目录状态变为灰色`PENDING`,表示连接开始(约3分钟) -8. 您看到目录状态变为蓝色`CRAWLING`。(200张表约15分钟) -9. 您看到目录状态边绿色 `ACTIVE`,则表示已为 RDS 实例创建了数据目录。此时您可以点击对应 数据目录 的连接进行初步查看,以及后续扫描工作。 - -至此,您已经通过JDBC方式建立好的RDS代理的数据源连接,可以开始下一步操作了。 \ No newline at end of file diff --git a/docs/zh/user-guide/data-catalog-create-jdbc-redshift.md b/docs/zh/user-guide/data-catalog-create-jdbc-redshift.md deleted file mode 100644 index 623dc74b..00000000 --- a/docs/zh/user-guide/data-catalog-create-jdbc-redshift.md +++ /dev/null @@ -1,37 +0,0 @@ -# 连接到数据源 - JDBC(Redshift) - -当您希望对某个Redshift Cluster进行敏感数据扫描时,您可以将Redshift的database作为数据源。 - -### 前提条件 - 保持网络连通性 -1. 请确认您[添加AWS账户](data-source.md)时,选择的是CloudFormation方式。如果您添加账户时,选择JDBC方式,请转至[通过EC2代理连接数据库](data-catalog-create-jdbc-database-proxy.md)进行操作。 -2. 准备好Redshift的连接凭证(用户名/密码) - -!!! Info "如何获得Redshift凭证" - DBA或业务方创建一个只读的用户(User)做安全审计使用。授予此用户只读权限:`GRANT SHOW VIEW, SELECT ON *.* TO 'reader'@'%'`; - -## 连接Amazon Redshift数据源 -1. 从左侧菜单,选择 **连接数据源** -2. 选择**AWS Cloud**标签页 -3. 单击进入一个AWS帐户,打开详细页面 -4. 选择 **自定义数据库(JDBC)** 标签页。 -5. 点击**操作**,**添加数据源** -6. 在弹出窗口中,输入Redshift凭证信息。(如果您选择Secret Manager方式,需要提前为此Redshift的用户名/密码托管在Secret Manager。) - - | 参数 | 必填项 | 参数描述 | - |-------------------|--------|--------------------------------------------------------------------------------------------------------------------| - | 实例名称 | 是 | Cluster中database名称 | - | 勾选SSL连接 | 否 | 是否通过SSL连接 | - | 描述(选填) | 否 | 描述 | - | JDBC URL(必填) | 是 | 填写一个Redshift的database,用于连接和扫描。具体格式:`jdbc:redshift://url:port/databasename` 。例如:`jdbc:redshift://sdp-uat-redshift.xxxxxxxxxx.us-east-1.redshift.amazonaws.com.cn:5439/dev`| - | JDBC数据库 | 否 | 保持为空 | - | 凭证 | 是 | 选择用户名密码或SecretManager。填写数据库的用户名/密码。参数获取途径:DBA或业务方为安全团队创建一个只读的User。此用户只需要数据库 SELECT(只读权限) | - | VPC | 是 | 选择Redshift所在的VPC | - | 子网 | 是 | 选择Redshift所在的VPC子网 | - | 安全组 | 是 | 选择Redshift所在的VPC安全组 | - -7. 点击 **连接**。您可以等待10s关闭此窗口。 -8. 您看到目录状态变为灰色`PENDING`,表示连接开始(约3分钟) -9. 您看到目录状态变为蓝色`CRAWLING`。(200张表约15分钟) -10. 您看到目录状态边绿色 `ACTIVE`,则表示已为该Redshift Cluster创建了数据目录。 - -至此,您已经连接好Redshift数据源了,可以开始下一步操作👉[定义分类分级模版](data-identifiers.md)。 \ No newline at end of file diff --git a/docs/zh/user-guide/data-catalog-create-jdbc.md b/docs/zh/user-guide/data-catalog-create-jdbc.md index 490d45ac..e278c42c 100644 --- a/docs/zh/user-guide/data-catalog-create-jdbc.md +++ b/docs/zh/user-guide/data-catalog-create-jdbc.md @@ -2,6 +2,10 @@ 当您希望对某一种数据库进行敏感数据扫描时,您可以将DB instance或databases作为数据源。 +首先,请确认您[添加AWS账户](data-source.md)时,选择的是CloudFormation方式。如果您添加账户时,选择JDBC方式,请转至[通过EC2代理连接数据库](data-catalog-create-jdbc-database-proxy.md)进行操作。 + +当前支持的JDBC数据源 + | 支持的数据库类型 | |-----------------------| | Amazon Redshift | @@ -14,43 +18,80 @@ | Amazon RDS for MariaDB| ### 前提条件 - 保持网络连通性 -1. 请确认您[添加AWS账户](data-source.md)时,选择的是CloudFormation方式。如果您添加账户时,选择JDBC方式,请转至[通过EC2代理连接数据库](data-catalog-create-jdbc-database-proxy.md)进行操作。 -2. 请确保待检测数据库的inbound rule上有所在安全组的自引用, 操作详见[官网文档](https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)。 -3. 准备好Redshift的连接凭证(用户名/密码) + +1. 请确保待检测数据库的inbound rule上有所在安全组的自引用, 操作详见[官网文档](https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)。 +2. 准备好数据库连接凭证(用户名/密码) !!! Info "如何获得JDBC凭证" DBA或业务方创建一个只读的用户(User)做安全审计使用。授予此用户只读权限:`GRANT SHOW VIEW, SELECT ON *.* TO 'reader'@'%'`; -## 连接Amazon Redshift数据源 +## 连接单个JDBC数据源 1. 从左侧菜单,选择 **连接数据源** 2. 选择**AWS Cloud**标签页 3. 单击进入一个AWS帐户,打开详细页面 4. 选择 **自定义数据库(JDBC)** 标签页。 5. 点击**操作**,**添加数据源** -6. 在弹出窗口中,输入Redshift凭证信息。(如果您选择Secret Manager方式,需要提前为此Redshift的用户名/密码托管在Secret Manager。) +6. 在弹出窗口中,输入数据库凭证信息。(如果您选择Secret Manager方式,需要提前将用户名/密码托管在Secret Manager。) | 参数 | 必填项 | 参数描述 | |-------------------|--------|--------------------------------------------------------------------------------------------------------------------| | 实例名称 | 是 | 数据库名称 | | 勾选SSL连接 | 否 | 是否通过SSL连接 | | 描述(选填) | 否 | 实例描述 | - | JDBC URL(必填) | 是 | 填写一个database,用于连接和扫描。具体格式请参见下表。| + | 数据库类型 | 是 | 选择MySQL 还是其他。如果是MySQL,方案支持自动查询实例中的数据库。若其他,您需要手动添加DB列表。 | + | JDBC URL(必填) | 是 | 填写一个database,用于连接和扫描。具体格式请参见本文最下面“JDBC URL格式以及样例”。| | JDBC数据库 | 否 | 如果您希望在一个数据目录展示多个数据库,则填写数据库列表。例如,1个数据目录为1个数据库实例,您可以填写instance下多个数据库。如果您只希望扫描此instance下一个数据库,则保留为空。 | | 凭证 | 是 | 选择用户名密码或SecretManager。填写数据库的用户名/密码。 | | VPC | 是 | 选择数据库所在的VPC | | 子网 | 是 | 选择数据库所在的VPC子网 | | 安全组 | 是 | 选择数据库所在的VPC安全组 | -7. 点击 **连接**。您可以等待10s关闭此窗口。 -8. 您看到目录状态变为灰色`PENDING`,表示连接开始(约3分钟) -9. 您看到目录状态变为蓝色`CRAWLING`。(200张表约15分钟) -10. 您看到目录状态边绿色 `ACTIVE`,则表示已为该Redshift Cluster创建了数据目录。 +7. 点击 **授权**。您可以等待10s关闭此窗口。 +8. 您看到目录状态变为蓝色`AUTHROZIED`(已授权)。这也意味着在SDP后台,AWS Glue已经成功创建Crawler。 + +**至此,您已经以JDBC方式连接到此数据源了🎉。您可以开始下一步操作👉[定义分类分级模版](data-identifiers.md)。** + +当您配置完分类模版,并运行完敏感数据发现任务。 + +- 若任务成功:您会在此数据源页面看到目录状态为绿色 `ACTIVE`,则表示已为该数据创建了数据目录。 +- 若任务失败:您会在此数据源页面看到目录状态灰色`Error message`,你可以将鼠标放在错误上面看到具体信息。 + +## 批量自动创建JDBC数据源 + +如果您有很多的数据源,在UI上一个一个添加数据源不是很方便,此时,您可以使用这个批量创建的功能。 + +### Step 1: 下载模版 +在AWS账号管理页面,点击**批量创建**按钮。 +在批量操作页面,首先下载“批量创建数据源”模版(.xlsm)。 + +### Step 2: 编辑模版文件 +用Microsoft Excel打开这个文件。Excel软件会提示,“是否需要Enabled Marcos?”,选择Enable。 +![edit-icon](docs/../../images/batch_create_datasource_enablemarcos.png) -至此,您已经连接好Redshift数据源了,可以开始下一步操作👉[定义分类分级模版](data-identifiers.md)。 +填入您所需要扫描的数据源,建议少量多次(方便排查错误)。 +| InstanceName | SSL | Description | JDBC_URL | JDBC_Databases | SecretARN | Username | Password | AccountID | Region | ProviderID | +|---------------------|-----|--------------------------------------------------------------------|----------------------------------------------|----------------|-----------|----------|------------|----------------------|----------------|------------| +| test-instance-7001 | 1 | xxxx1.sql.db.com:23297 | jdbc:mysql://172.31.48.6:7001 | | | root | Temp123456! | 123456789 | ap-guangzhou-1 | 1 | +| test-instance-7002 | 1 | xxxx2.sql.db.com:3306 | jdbc:mysql://172.31.48.6:7002 | | | root | Temp123456! | 123456789 | ap-guangzhou-1 | 1 | -!!! Info "JDBC URL格式以及样例" + +## 通过代理数据库连接到数据源 + +当您的RDS/数据库在私有网络,且对于IP有严格的限制(只允许固定IP进行接入),您需要通过这种方式进行数据源连接。 + +1. 创建数据库代理(Proxy):在方案所在VPC创建EC2作为代理机器,参考步骤详见:[附录:创建、配置数据库代理](appendix-database-proxy.md)。 +2. 配置Proxy时,配置Nginx的步骤,参考步骤详见:[附录:创建、配置数据库代理](appendix-database-proxy.md)。。 +3. 创建JDBC数据源时, + - Description字段建议填真实数据库地址。 + - JDBC URL字段填写`jdbc:mysql://ec2_public_ip:port/databasename` 。 + - Provider字段的值填4。 (批量创建模版中需要填写) + +-- + +### 创建数据源的参数 +JDBC URL格式以及样例 | JDBC URL | Example | |-------------------------------------------------|----------------------------------------------------------------------------------------------| @@ -63,3 +104,14 @@ | Amazon RDS for MariaDB | `jdbc:mysql://xxx-cluster.cluster-xxx.aws-region.rds.amazonaws.com:3306/employee` | | Snowflake (Standard Connection) | `jdbc:snowflake://account_name.snowflakecomputing.com/?user=user_name&db=sample&role=role_name&warehouse=warehouse_name` | | Snowflake (AWS PrivateLink Connection) | `jdbc:snowflake://account_name.region.privatelink.snowflakecomputing.com/?user=user_name&db=sample&role=role_name&warehouse=warehouse_name` | + + +Provider参数(批量创建时使用): + +| Provider | Provider Id | Description | +|------------|-------------|-----------------------------------| +| AWS | 1 | AWS (Installed method: CloudFormation) | +| Tencent | 2 | Tencent account | +| Google | 3 | Google account | +| AWS(JDBC Only) | 4 | AWS (Installed method:JDBC Only) | + diff --git a/docs/zh/user-guide/discovery-job-create.md b/docs/zh/user-guide/discovery-job-create.md index 13052858..da08b88c 100644 --- a/docs/zh/user-guide/discovery-job-create.md +++ b/docs/zh/user-guide/discovery-job-create.md @@ -2,33 +2,41 @@ ## 创建发现作业 -1. 在左侧菜单,选择**执行敏感数据发现作业**。 -2. 选择**创建敏感数据发现作业**。 +在左侧菜单,选择**执行敏感数据发现作业** + +点击按钮,**创建敏感数据发现作业**。 ![edit-icon](docs/../../images/job-list-cn.png) - - **步骤1:选择数据源** +**步骤1**: 选择Provider和数据源 + +| Provider | Data source | +|----------------|--------------------| +| AWS | S3, RDS, Glue, Custom databases,Proxy databases | +| Tencent | JDBC | +| Google | JDBC | + +!!! Info "AWS的CustomDB和ProxyDB是指什么?" + - 如果是本账号扫描,连接JDBC数据源,那么,请选择CustomDB + - 如果您添加账号时,选择了CloudFormation的安装方式,连接了JDBC数据源,那么,请选择Custom databases + - 如果您添加账号时,选择了JDBC Only的安装方式,连接了JDBC数据源,那么,请选择Proxy databases - | Provider | Data source | - |----------------|--------------------| - | AWS | S3, RDS, Glue, JDBC | - | Tencent | JDBC | - | Google | JDBC | +**步骤2**: 选择具体待扫描的数据源 - - **步骤2:作业设置** +**步骤3**: 作业设置 - | 作业设置 | 描述 | 选项 | - | --- | --- | --- | - | 扫描频率 | 指发现作业的扫描频率。 | 按需运行
每日
每周
每月 | - | 扫描深度 | 指抽样行数。 | 100(推荐)
10, 30, 60, 100, 300, 500, 1000 | - | 扫描深度 - 非结构化数据 | 仅适用于S3,不同文件夹下,抽样非结构化文件数量 | 可跳过, 10文件, 30文件, 所有文件 | - | 扫描范围 | 定义目标数据源的整体扫描范围。
“全面扫描”表示扫描所有目标数据源。
“增量扫描”表示跳过自上次数据目录更新以来未更改的数据源。 | 全面扫描
增量扫描(推荐) | - | 检测阈值 | 定义作业所需的容忍度水平。如果扫描深度为 1000 行,则 10% 的阈值意味着如果超过 100 行(共 1000 行)匹配标识符规则,则该列将被标记为敏感。较低的阈值表示该作业对敏感数据的容忍度较低。 | 10%(推荐)
20%
30%
40%
50%
100% | - | 覆盖手动更新的隐私标签 | 选择是否允许该作业使用作业结果覆盖数据目录隐私标签。 | 不覆盖(推荐)
覆盖 | +| 作业设置 | 描述 | 选项 | +| --- | --- | --- | +| 扫描频率 | 指发现作业的扫描频率。 | 按需运行
每日
每周
每月 | +| 扫描深度 | 指抽样行数。 | 100(推荐)
10, 30, 60, 100, 300, 500, 1000 | +| 扫描深度 - 非结构化数据 | 仅适用于S3,不同文件夹下,抽样非结构化文件数量 | 可跳过, 10文件, 30文件, 所有文件 | +| 扫描范围 | 定义目标数据源的整体扫描范围。
“全面扫描”表示扫描所有目标数据源。
“增量扫描”表示跳过自上次数据目录更新以来未更改的数据源。 | 全面扫描
增量扫描(推荐) | +| 检测阈值 | 定义作业所需的容忍度水平。如果扫描深度为 1000 行,则 10% 的阈值意味着如果超过 100 行(共 1000 行)匹配标识符规则,则该列将被标记为敏感。较低的阈值表示该作业对敏感数据的容忍度较低。 | 10%(推荐)
20%
30%
40%
50%
100% | +| 覆盖手动更新的隐私标签 | 选择是否允许该作业使用作业结果覆盖数据目录隐私标签。 | 不覆盖(推荐)
覆盖 | - - **步骤3:高级配置项** - - **步骤4:作业预览** +**步骤4**:高级配置项 -1. 预览作业后,选择**运行作业**。 +**步骤5**:作业预览 + 预览作业后,选择**运行作业**。 ---