Skip to content

Latest commit

 

History

History
112 lines (85 loc) · 11.6 KB

README.md

File metadata and controls

112 lines (85 loc) · 11.6 KB

NuGet Insights data tables

The primary purpose of NuGet Insights is to generate CSV data files which can be easily imported into many different query systems. The implementation and some extra metadata are tailored for making a data import into Azure Data Explorer (a.k.a. Kusto) easier, but since CSVs are supported for import in so many technologies, the hope is that you can adapt the data files to your needs with relative ease.

If you are using NuGet Insights in Azure and want to use Azure Data Explorer, there is a built-in importer feature that utilizes its ingestion endpoint. If you're not using Azure Data Explorer, you can download the CSV files from Blob Storage and import them. Or, if your import flow supports URLs, you can generate SAS tokens for the URLs or make your storage container publicly accessible and import directly using Blob Storage blob URLs.

Each data table is represented by a set of CSV files. The data table is split (i.e. partitioned) across multiple files so that processing the data files in memory feasible. Also, many data import systems can operate in parallel across many files. Therefore, having more files that are smaller is often better than having fewer large files.

The CSV file format itself is loosely defined across the industry but NuGet.Insights adheres to RFC 4180. Notably, strings containing certain characters (e.g. line breaks, double quotes, commas) are enclosed in double quotes. All CSV files are UTF-8 encoded without a byte order mark (BOM).

Tables

The following tables are currently generated by NuGet Insights:

Table name Description
CatalogLeafItems All catalog leaf items per package
Certificates All certificates referenced in available package signatures
ExcludedPackages Whether a package ID is excluded from default search results
NuGetPackageExplorerFiles SourceLink and debug information for assemblies
NuGetPackageExplorers Reproducibility status for NuGet packages
PackageArchiveEntries Low level information about .nupkg ZIP file entries
PackageArchives Hashes and low level information about .nupkg ZIP archives
PackageAssemblies .NET assembly metadata such as assembly name, version, and strong name status
PackageAssets Package assets and target framework as interpreted by the NuGet restore
PackageCertificates Mapping table between packages and certificates (many-to-many relationship)
PackageCompatibilities Framework/platform compatibility defined by several different algorithms
PackageContents Subset of text content files in each package
PackageDeprecations Package deprecation data per package version, if any
PackageDownloads Current download counts for individual package versions
PackageFiles Basic information (such as hash) of every file in the package .nupkg
PackageIcons Metadata about embedded and icon-URL icons available on NuGet.org
PackageLicenses Metadata about package license files, expressions and URLs
PackageManifests Recognized values from .nuspec files
PackageOwners Current package owner usernames (user or organization)
PackageReadmes Metadata NuGet package READMEs (either embedded or legacy)
PackageSignatures Metadata about the NuGet package signature, e.g. author signing certificate
PackageVersions Components of the package SemVer 2.0.0 and latest version status
PackageVulnerabilities Package vulnerability data per package version, if any
PopularityTransfers Popularity transfers for a give package ID, if any
SymbolPackageArchiveEntries Low level information about .snupkg ZIP file entries
SymbolPackageArchives Hashes and low level information about .snupkg ZIP archives
SymbolPackageFiles Basic information (such as hash) of every file in the package .snupkg
VerifiedPackages Whether a package ID is verified or not

Terminology & concepts

In these documents, the following language is used. To avoid confusion, here's what is meant.

Record: One row in the CSV data file. May represent a view of a single package or something more granular.

Package ID: This is the string identity of a package shared by all of its versions, e.g. Newtonsoft.Json.

Package version: This is the SemVer string that refers to a specific version of a package ID, e.g. 9.0.1.

Package: In isolation, this refers to a specific version of a package. A package is identified by an ID and version.

Deleted packages: When a package is deleted from NuGet.org, it must be scrubbed shortly after in the NuGet Insights data sets. This means that historical metadata about deleted packages is not available. The only information that is retained about deleted packages is the package ID, version, and perhaps some enum value indicated that the package was deleted.

Cardinality: Most tables have data related to individual packages (i.e. package versions) but others are at the package ID level or have multiple records per package.

Database normalization: Traditional database normalization is mostly avoided for two reasons.

  1. Normalized (de-duped) data shared between packages makes data clean-up due to package deletes much more complicated.
  2. Query systems like Azure Data Explorer thrive on denormalized data and queries often run faster if you don't have to perform a join.

Table relationships: Sometimes have a child or parent relationship with other tables. This is a primitive form of denormalization and is mostly used to ease some query-time scenarios rather than reducing data size.

Data types

CSV itself doesn't prescribe data types for columns but documentation for each table will describe column data types in simple terms than can be easily mapped to a programming language or schema system of your choice. An empty string in a CSV cell indicates a null value for non-string cells.

Data type Description
string Simple string of characters. Depending on the column, this can potentially contain line break characters
bool A boolean, represented as the string true or false
short Signed 16-bit integer, represented in decimal form
ushort Unsigned 16-bit integer, represented in decimal form
uint Unsigned 32-bit integer, represented in decimal form
int Signed 32-bit integer, represented in decimal form
long Signed 64-bit integer, represented in decimal form
float A floating point number stored in 4 bytes
double A floating point number stored in 8 bytes
timestamp A date and time, represented as an ISO 8601 string. Always in UTC time with the Z suffix
enum An enumerated value, represented as a string. All possible values are documented
flags enum One or more enumerated values, represented as a comma separated list of strings. All possible values are documented
object A JSON encoded object. The schema of the object is documented
array of objects A JSON encoded array of objects. The schema of the array is documented
array of strings A JSON encoded array of strings. The schema of the array is documented

Partitioning algorithm

If a table has more than one data file, the following algorithm is used to determine which specific CSV file a record should go in to. The primary implemention is written in C# (copied below for understanding). To calculate the bucket in Kusto (KQL), you can use the GetBucket user-defined function implemented in this repository.

The bucketCount is the maximum number of data files. This is a fixed number per table. The bucketKey is a string used for finding the specific CSV file (a.k.a. bucket) to put the record in to. On each table document, the Column used for partitioning property states which CSV column was used for partitioning.

Example: when the bucketCount is 1000 and the bucketKey is newtonsoft.json, the bucket is 892.

C# implementation

public static int GetBucket(int bucketCount, string bucketKey)
{
    int bucket;
    using (var algorithm = SHA256.Create())
    {
        var hash = algorithm.ComputeHash(Encoding.UTF8.GetBytes(bucketKey));
        bucket = (int)(BitConverter.ToUInt64(hash) % (ulong)bucketCount);
    }

    return bucket;
}