Skip to content

Tuning the Generator

Gabor Szarnyas edited this page May 30, 2021 · 1 revision

Tuning the Generator

Datagen supports tuning some parts of the data generation process. This allows the user to change the way the degree distribution of the Person friendship subgraph is generated, the way the edges between the Persons are created and how data is serialized.

Knows Degree Distribution

Datagen defines an interface to implement custom ways to generate friendship degree distributions for Persons. This interface can be found in the following file: ldbc.snb.datagen.generator.distribution.DegreeDistribution

This interface defines three methods:

  • initialize(Configuration conf): This is called once an instance implementing the interface is created at the beginning of the person generation process. The parameter conf is used to pass custom configuration parameters by means of the params.ini file, in the same way it is done for other parameters of Datagen.
  • reset(long seed): This method is called everytime Datagen needs to set the data generator into a determined state. This method must set the class implementing the DegreeDistribution interface into a state in such a way that two identical number of calls to the nextDegree() method after a call to reset with an identical seed, will produce the exact same sequence of numbers. This is done to guarantee determinism within Datagen.
  • nextDegree(): This method is called each time we want to get a new degree for a Person.

In order to tell Datagen to use a particular DegreeDistribution implementation, add the following line in you params.ini file:

ldbc.snb.datagen.generator.distribution.degreeDistribution:<full java classpath of the implementation>

Datagen already includes several different degree distributions with the following subclass relations:

  • ldbc.snb.datagen.generator.distribution.DegreeDistribution
    • ldbc.snb.datagen.generator.distribution.BucketedDistribution
      • ldbc.snb.datagen.generator.distribution.EmpiricalDistribution
      • ldbc.snb.datagen.generator.distribution.FacebookDegreeDistribution
    • ldbc.snb.datagen.generator.distribution.CumulativeBasedDegreeDistribution
      • ldbc.snb.datagen.generator.distribution.AltmannDistribution
      • ldbc.snb.datagen.generator.distribution.DiscreteWeibullDistribution
    • ldbc.snb.datagen.generator.distribution.GeoDistribution
    • ldbc.snb.datagen.generator.distribution.MoeZipfDistribution
    • ldbc.snb.datagen.generator.distribution.ZipfDistribution

The default distribution generator is ldbc.snb.datagen.generator.distribution.FacebookDegreeDistribution

FacebookDegreeDistribution

This implements a degree distribution that tries to model that observed in Facebook.

AltmannDistribution

This implements the Altmann Distribution, which accepts the following parameters:

Option Default Description
ldbc.snb.datagen.generator.distribution.AltmannDistribution.alpha 0.4577 The value of the parameter alpha of the Altmann Distribution
ldbc.snb.datagen.generator.distribution.AltmannDistribution.beta 0.0162 The value of the parameter beta of the Altmann Distribution

DiscreteWeibullDistribution

This implements the Discrete Weibull distribution, which accepts the following parameters:

Option Default Description
ldbc.snb.datagen.generator.distribution.DiscreteWeibullDistribution.alpha 0.8505 The value of the parameter beta of the Discrete Weibull Distribution
ldbc.snb.datagen.generator.distribution.DiscreteWeibullDistribution.p 0.0205 The value of the parameter p of the Discrete Weibull Distribution

GeoDistribution

This implements the Geometric distribution, which accepts the following parameters:

Option Default Description
ldbc.snb.datagen.generator.distribution.GeoDistribution.alpha 0.12 The value of the parameter alpha of the Geometric Distribution

ZipfDistribution

This implements the Zipf distribution, which accepts the following parameters:

Option Default Description
ldbc.snb.datagen.generator.distribution.ZipfDistribution.alpha 1.7 The value of the parameter alpha of the Zipf Distribution

MoeZipfDistribution

This implements the MoeZipf distribution, which accepts the following parameters:

Option Default Description
ldbc.snb.datagen.generator.distribution.MoeZipfDistribution.alpha 1.7 The value of the parameter alpha of the MoeZipf Distribution
ldbc.snb.datagen.generator.distribution.MoeZipfDistribution.delta 1.5 The value of the parameter delta of the MoeZipf Distribution

Edge Generation

Similar to friendship degree distribution, Datagen defines an interface that can be implemented to change the way the knows edges are connected: ldbc.snb.datagen.generator.KnowsGenerator

This interface defines three methods

  • initialize(Configuration conf): This is called once an instance implementing the interface is created at the beginning of the edge generation process. The parameter conf is used to pass custom configuration parameters by means of the params.ini file, in the same way it is done for other parameters of Datagen.
  • generateKnows(ArrayList<Person> persons, int seed, ArrayList<Float> percentages, int step_index): This is called once the edge generation process starts, in order to generate the edges for a given block of persons. The first parameter is the array of persons to generate the edges for. The second parameter is a seed used to seed any random number generator used by the implementation. The implementation of must behave identically for two identical seeds, in such a way that two consecutive and identical sequences of operations will produce the same result. The percentages is an array containing the percentage of edges that must be created for each person, out of the maximum number of desired edges. Finally, step_index is used to know at which edge generation step we are, and to index the percentages array.

In order to tell Datagen to use a particular KnowsGenerator implementation, add the following line in you params.ini file:

ldbc.snb.datagen.generator.knowsGenerator:<full java classpath of the implementation>

Available knows generator implementations are:

  • ldbc.snb.datagen.generator.RandomKnowsGenerator
  • ldbc.snb.datagen.generator.DistanceKnowsGenerator
  • ldbc.snb.datagen.generator.ClusteringKnowsGenerator

The default generator is ldbc.snb.datagen.generator.DistanceKnowsGenerator

RandomKnowsGenerator

This generator creates edges between the Persons in the block totally randomly, trying to respect their set degrees, using the configuration model graph generator.

DistanceKnowsGenerator

This is the original LDBC Datagen edge generator process. This creates edges between persons in the block, with a probability based on their distance in the block

ClusteringKnowsGenerator

This creates edges with the goal of obtaining a target clustering coefficient, based on having a community structure. This generator accepts the following parameter

Option Default Description
ldbc.snb.datagen.generator.ClusteringKnowsGenerator.clusteringCoefficient 0.1 The value of desired clustering coefficient

Date and DateTime formatting

You can customize the way Dates and DateTimes are formatted in Datagen. By implementing the DateFormatter interface, you can control the actual format of the timestamps. In your params.ini file, you can set the following option pointing to your implemented plugin as in the following example:

ldbc.snb.datagen.serializer.dateFormatter:ldbc.snb.datagen.serializer.formatter.LongDateFormatter

We provide two default formatters:

For the StringFormatter, the actual string format can be customized using the default Java way of specifying timestamp formats. For example:

ldbc.snb.datagen.serializer.formatter.StringDateFormatter.dateTimeFormat:"yyyy-MM-dd HH:mm:ss.SSS"

Custom weight computation

The computation of the weights on the edges of the person-knows-person subgraphs can be customized by means of an implementation of the Person.PersonSimilarity interface. You can specify your actual implementation in your params.ini file, like in the following example:

ldbc.snb.datagen.generator.person.similarity:ldbc.snb.datagen.objects.similarity.GeoDistanceSimilarity

Currently provided plugins are: GeoDistanceSimilarity (default), which computes the weight based on how close are persons geographically, and InterestsSimilarity, where the weight is based on the common interests of both persons