Skip to content

Framework for setting up environment needed for ASV tests

grusev edited this page Feb 28, 2025 · 8 revisions

General

Writing ASV tests against a single type of storage, moreover of none shared type, such as LMDB, is generally straightforward. All the moving parts, including the storage, are isolated on a specific developer machine or a GitHub runner. However, writing tests that need to use shared storage and can be executed concurrently in multiple locations can become challenging and lead to unpredictable results, as they will start to ovewrite each other's data. Additionally, we aim to reuse the same tests with minimal changes (such as modifying a single parameter) to run against different storage types like LMDB, Amazon S3, GCP, and Azure.

Furthermore, if we want to set up similar environments for other types of tests, we cannot use our current approach for writing tests and setups that are tightly coupled with the ASV test classes.

We need a safer approach that requires adding an additional layer of abstraction to help us achieve this safely and effectively.

What abstraction could solve the problem?

An abstraction that makes all storages appear as shared storages (including LMDB) and safeguards against potential issues, such as overwriting other test data.

To achieve this, we can use prefixes for Amazon S3 and paths for file locations in LMDB.

At the top level (root), there is:

  • one Bucket for Amazon S3
  • one root folder for LMDB storages

That will be dedicated space for ASV test

From there, we can create several different spaces, dedicated to different needs.

Persistant space

This shared space is meant to be accessed by all machines and clients. It is used to store data that is written once and then read by all clients. Therefore, this space is designated with a common prefix (PERSISTENT_LIBS_PREFIX) in the bucket.

Modifyiable space

This private space is meant for each client and test, isolated from others (all modifiable operations require such space). The prefix for this space is constructed from several parts:

  • First, the common name for this prefix (MODIFIABLE_LIBS_PREFIX).
  • Then, a unique part for each machine added via an environment variable (ARCTICDB_PERSISTENT_STORAGE_SHARED_PATH_PREFIX). This can be done for GitHub and individually by each person.
  • The third part should be defined in each test case.

By concatenating these three parts, we create a separate unique space in the storage for tests that require private control.

Test Space

This space is exactly like the Persistent Space. It is used to test the framework to ensure that every operation working on the persistent space functions correctly. We can take the risk and execute unknown operations on the shared persistent space. Therefore, we need a safe space similar to the real one but not identical. Again, a shared space with a different prefix is needed, but this time, no concatenations are required to make it private.

Note, that above abstraction with prefixes works with LMDB too as this time it will be the folder structure and the name of folders.

EnvConfigurationBase class

That abstraction is used as foundation behind the motivation to create an abstract class EnvConfigurationBase that will hide all those complexitied and provide easy and safe way to access shared storages and write asv tests that are safely isolated from each other.

For persistent space operations it provides the asv test developer with following methods:

  • get_library(<optional_suffix>) gets or creates a library if it doesn't exist, using the name of the test with an optional suffix on shared space.
  • get_arctic_client_persistent()
  • For obvious reasons, no delete convenience methods are available.
  • set_test_mode() - A special method that makes the 'persistent' space a test space, allowing all operations to be executed safely. Can and should be used for writing tests
  • The following methods only make sense for persistent space and not for modifiable space: ** setup_environment() - Provides an implementation to check if the necessary components are present and creates them if they are not. Useful for setting up the cache. ** setup_all() - An abstract method called by setup_environment(), intended to be implemented by subclasses. ** check_ok() - Checks if the necessary components are present. Also used by setup_environment() and needs to be implemented because it is abstract.

For modifiable storage:

  • get_modifiable_library(<optional_suffix>) - Gets or creates a library with the name of the test, along with an optional suffix, on the modifiable space. To be used in the setup() method of an ASV test. As ASV runs tests in several processes it is trongly recommended to pass as suxif unique id like process pid - self.lib = get_modifiable_library(os.getpid())
  • get_arctic_client_modifiable()
  • remove_all_modifiable_libraries() - Defined only for modifiable libraries. Safely removes all libraries on the modifyable private space of the test. Can be uset in setup_cache() method of ASV test
  • delete_modifiable_library(<optional_suffix>) - to be use in teartown() methods - self.lib.delete_modifiable_library(os.getpid())

This design provides the foundation for creating concrete implementations for setting up different libraries and symbol setups for various needs.

Some setups can be reusable in the future for ASV or other tests. In those cases, it makes sense to provide concrete solutions (implementations of the base class).

However, some tests may have unique needs and thus reuse of any setup logic outside of the base framework for accessing shared persistent and modifiable storage space is not applicable. For these cases, a general-purpose, no-setup class is available - GeneralUseCaseNoSetup. It provides only the framework for accessing storage spaces safely, no other additional logic

Utility Classes for setting up environemnt

The EnvConfigurationBase base class also provides the ability to build specialized classes to facilitate specific setups on the persistent space of the storage, which will help run ASV tests for non-modifiable ArcticDB operations like read(), list_snapshots(), etc.

Why Classes Instead of Functions?

Let's take the SetupSingleLibrary class as an example. Its purpose is to provide a configurable setup for tests that require a structure where we have one library and symbols with a specified number of rows.

To build such a structure on specific storage, we need data on how many symbols to create and how many rows they should have. We can easily pass that information, but then we need to return a certain structure containing all the necessary details, such as the library name, symbol names, and their properties (in our case, rows). In general, this could also include the number of versions, snapshots, metadata, etc. As you can see, there is plenty of information that might be needed in the current and future cases.

It is unhealthy to assume that all tests will require the same data structure. Thus, it is better to envision that some tests will need tiny symbols, some wide, others with indexes, and some without indexes, etc.

After creating the structure, you will need to use it in your tests to iterate over it. So, how exactly should the returned structure look? A list? A dictionary? But wait, ASV actually works with lists of lists, while other usage scenarios might work differently.

If we select the approach of creating a function, it cannot satisfy all needs. Yes, it will look simple, but only because we are thinking of exactly one or two cases.

Therefore, we need to avoid restricting such logic and make it open and upgradable. The only way to achieve that is through specialized classes.

Yes, they require more effort and code, but they provide lots of flexibility and are open for extensions and different usage patterns.

How to Integrate This into ASV?

Here is an example with the SetupSingleLibrary utility class that will provide clarity on how to use it and the motivation behind the design.

class AWSWideDataFrameTests:

    rounds = 1
    number = 3 # invokes 3 times the test runs between each setup-teardown 
    repeat = 1 # defines the number of times the measurements will invoke setup-teardown
    min_run_count = 1
    warmup_time = 0

    timeout = 1200

	
    SETUP_CLASS = (SetupSingleLibrary(storage=Storage.AMAZON, # Must define the type of the storage to be used
                                                  # Define UNIQUE STRING for persistent libraries names 
                                                  # it will also be used for creating unique storage prefix for this test 
                                                  # for modifiable libraries
                                                  prefix="WIDE_TESTS", 
						  # Library options can be passed via constructors
                                                  library_options=LibraryOptions(rows_per_segment=1000, columns_per_segment=1000))
  					          # As ASV can use parameters in certain way we can make sure we pass the 
					          # parameters to the class instance (then we use class instance to pass the parameters
 					          # back to asv. And the class must always assume that those parameters passed 
					          # are asv type parameters
                                                  .set_params([
                                                      [2500, 3000],
                                                      [15000, 30000]]))

    params = SETUP_CLASS.get_parameter_list()
    param_names = ["num_rows", "num_cols"] 

    def setup_cache(self):
	# at setup_cache we invoke setting up persistant environment setup
	# In our case the GeneralSetupLibraryWithSymbols implements specific logic
	# needed for check_ok() method to provide info if setup is ok
	# and setup_all() to do initial setup if not ok
        setup_env = AWSWideDataFrameTests.SETUP_CLASS.setup_environment() 
		
	# Now because ASV will run tests in other processes
	# those processes will have their own instance of GeneralSetupLibraryWithSymbols
	# the only way to synchronice the processes is to serialize
	# the class info and then in setup() to create new instance that will
	# have the needed information
        info = setup_env.get_storage_info()
        setup_env.logger().info(f"storage info object: {info}")
		
	# if we have test with modifyable lib we can safely wipe out all
	setup_env.remove_all_modifiable_libraries()
		
	# this is the only way to pass the info from the setup process to the other proceess
        return info 

    def setup(self, storage_info, num_rows, num_cols):
	# Now after creating the process instance with the data from the
	# setup process instance we can be sure that all is synchronized
        self.storage = SetupSingleLibrary.from_storage_info(storage_info)
		
	# As the specialized class works on persistent storage we obtain
	# the library where the environment is setup 
	self.lib = self.storage.get_library() 
		
	# Writing into library that has suffix same as process
        ## will protect ASV processes from writing on one and same symbol
        ## this way each one is going to have its unique library
        self.write_library = self.storage.get_modifiable_library(os.getpid())
		

    def tear_down(self)
	## Delete our modifyable library
        self.storage.delete_modifiable_library(os.getpid())
		
		
    def time_read_wide(self, storage_info, num_rows, num_cols):
	# Easiest way to get access to symbol is to use
	# exact parameters used for its creation, therefore
	# the utility class provides functions that are used for symbol
	# creation and later in test for accessing it
        sym = self.storage.get_symbol_name(num_rows, num_cols)
        .....
	# In similar fashion utility classes may override the library 
	# names if the setup they create requires more libraries
Clone this wiki locally