forked from lattice/quda
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNEWS
272 lines (186 loc) · 10.3 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
Version 0.4.1
- As of QUDA 0.4.0, support has been dropped for the very first
generation of CUDA-capable devices (implementing "compute
capability" 1.0). These include the Tesla C870, the Quadro FX 5600
and 4600, and the GeForce 8800 GTX.
- Fixed a typo that prevented domain_wall_dslash_test from compiling
(and thus subsequent tests unless domain wall was disabled).
- Added a new interface function, setVerbosityQuda(), to allow for
finer-grained control of status reporting. See the description in
include/quda.h for usage information.
- Merged wilson_dslash_test and domain_wall_dslash_test together into
a unified dslash_test, and likewise for invert_test. The staggered
tests are still separate for now.
Version 0.4.0 - 4 April 2012
- CUDA 4.0 or later is now required to build the library.
- The "make.inc.example" template has been replaced by a configure script.
See the README file for build instructions and "configure --help" for
a list of configure options.
- Emulation mode is no longer supported.
- Added support for using multiple GPUs in parallel via MPI or QMP.
This is supported by all solvers for the Wilson, clover-improved
Wilson, twisted mass, and improved staggered fermion actions.
Multi-GPU support for domain wall will be forthcoming in a future
release.
- Reworked auto-tuning so that BLAS kernels are tuned at runtime,
Dirac operators are also tuned, and tuned parameters may be cached
to disk between runs. Tuning is enabled via the "tune" member of
QudaInvertParam and is essential for achieving optimal performance
in the solvers. See the README file for details on enabling
caching, which avoids the overhead of tuning for all but the first
run at a given set of parameters (action, precision, lattice volume,
etc.).
- Added NUMA affinity support. Given a sufficiently recent linux
kernel and a system with dual I/O hubs (IOHs), QUDA will attempt to
associate each GPU with the "closest" socket. This feature is
disabled by default under OS X and may be disabled under linux via
the "--disable-numa-affinity" configure flag.
- Improved stability on Fermi-based GeForce cards by disabling double
precision texture reads. These may be re-enabled on Fermi-based
Tesla cards for improved performance, as described in the README
file.
- Added command-line options for most of the tests. See, e.g.,
"wilson_dslash_test --help"
- Added CPU reference implementations of all BLAS routines, which allows
tests/blas_test to check for correctness.
- Implemented various structural and performance improvements
throughout the library.
- Deprecated the QUDA_VERSION macro (which corresponds to an integer
in octal). Please use QUDA_VERSION_MAJOR, QUDA_VERSION_MINOR, and
QUDA_VERSION_SUBMINOR instead.
Version 0.3.2 - 18 January 2011
- Fixed a regression in 0.3.1 that prevented the BiCGstab solver from
working correctly with half precision on Fermi.
Version 0.3.1 - 22 December 2010
- Added support for domain wall fermions. The length of the fifth
dimension and the domain wall height are set via the 'Ls' and 'm5'
members of QudaInvertParam. Note that the convention is to include
the minus sign in m5 (e.g., m5 = -1.8 would be a typical value).
- Added support for twisted mass fermions. The twisted mass parameter
and flavor are set via the 'mu' and 'twist_flavor' members of
QudaInvertParam. Similar to clover fermions, both symmetric and
asymmetric even/odd preconditioning are supported. The symmetric
case is better optimized and generally also exhibits faster
convergence.
- Improved performance in several of the BLAS routines, particularly
on Fermi.
- Improved performance in the CG solver for Wilson-like (and domain
wall) fermions by avoiding unnecessary allocation and deallocation
of temporaries, at the expense of increased memory usage. This will
be improved in a future release.
- Enabled optional building of Dirac operators, set in make.inc, to
keep build time in check.
- Added declaration for MatDagMatQuda() to the quda.h header file and
removed the non-existent functions MatPCQuda() and
MatPCDagMatPCQuda(). The latter two functions have been absorbed
into MatQuda() and MatDagMatQuda(), respectively, since
preconditioning may be selected via the solution_type member of
QudaInvertParam.
- Fixed a bug in the Wilson and Wilson-clover Dirac operators that
prevented the use of MatPC solution types.
- Fixed a bug in the Wilson and Wilson-clover Dirac operators that
would cause a crash when QUDA_MASS_NORMALIZATION is used.
- Fixed an allocation bug in the Wilson and Wilson-clover
Dirac operators that might have led to undefined behavior for
non-zero padding.
- Fixed a bug in blas_test that might have led to incorrect autotuning
for the copyCuda() routine.
- Various internal changes: removed temporary cudaColorSpinorField
argument to solver functions; modified blas functions to use C++
complex<double> type instead of cuDoubleComplex type; improved code
hygiene by ensuring that all textures are bound in dslash_quda.cu
and unbound after kernel execution; etc.
Version 0.3.0 - 1 October 2010
- CUDA 3.0 or later is now required to build the library.
- Several changes have been made to the interface that require setting
new parameters in QudaInvertParam and QudaGaugeParam. See below for
details.
- The internals of QUDA have been significantly restructured to facilitate
future extensions. This is an ongoing process and will continue
through the next several releases.
- The inverters might require more device memory than they did before.
This will be corrected in a future release.
- The CG inverter now supports improved staggered fermions (asqtad or
HISQ). Code has also been added for asqtad link fattening, the asqtad
fermion force, and the one-loop improved Symanzik gauge force, but
these are not yet exposed through the interface in a consistent way.
- A multi-shift CG solver for improved staggered fermions has been
added, callable via invertMultiShiftQuda(). This function does not
yet support Wilson or Wilson-clover.
- It is no longer possible to mix different precisions for the
spinors, gauge field, and clover term (where applicable). In other
words, it is required that the 'cuda_prec' member of QudaGaugeParam
match both the 'cuda_prec' and 'clover_cuda_prec' members of
QudaInvertParam, and likewise for the "sloppy" variants. This
change has greatly reduced the time and memory required to build the
library.
- Added 'solve_type' to QudaInvertParam. This determines how the linear
system is solved, in contrast to solution_type which determines what
system is being solved. When using the CG inverter, solve_type should
generally be set to 'QUDA_NORMEQ_PC_SOLVE', which will solve the
even/odd-preconditioned normal equations via CGNR. (The full
solution will be reconstructed if necessary based on solution_type.)
For BiCGstab, 'QUDA_DIRECT_PC_SOLVE' is generally best. These choices
correspond to what was done by default in earlier versions of QUDA.
- Added 'dagger' option to QudaInvertParam. If 'dagger' is set to
QUDA_DAG_YES, then the matrices appearing in the chosen solution_type
will be conjugated when determining the system to be solved by
invertQuda() or invertMultiShiftQuda(). This option must also be set
(typically to QUDA_DAG_NO) before calling dslashQuda(), MatPCQuda(),
MatPCDagMatPCQuda(), or MatQuda().
- Eliminated 'dagger' argument to dslashQuda(), MatPCQuda(), and MatQuda()
in favor of the new 'dagger' member of QudaInvertParam described above.
- Removed the unused blockDim and blockDim_sloppy members from
QudaInvertParam.
- Added 'type' parameter to QudaGaugeParam. For Wilson or Wilson-clover,
this should be set to QUDA_WILSON_LINKS.
- The dslashQuda() function now takes takes an argument of type
QudaParityType to determine the parity (even or odd) of the output
spinor. This was previously specified by an integer.
- Added support for loading all elements of the gauge field matrices,
without SU(3) reconstruction. Set the 'reconstruct' member of
QudaGaugeParam to 'RECONSTRUCT_NO' to select this option, but note
that it should not be combined with half precision unless the
elements of the gauge matrices are bounded by 1. This restriction
will be removed in a future release.
- Renamed dslash_test to wilson_dslash_test, renamed invert_test to
wilson_invert_test, and added staggered variants of these test
programs.
- Improved performance of the half-precision Wilson Dslash.
- Temporarily removed 3D Wilson Dslash.
- Added an 'OS' option to make.inc.example, to simplify compiling for
Mac OS X.
Version 0.2.5 - 24 June 2010
- Fixed regression in 0.2.4 that prevented the library from compiling
when GPU_ARCH was set to sm_10, sm_11, or sm_12.
Version 0.2.4 - 22 June 2010
- Added initial support for CUDA 3.x and Fermi (not yet optimized).
- Incorporated look-ahead strategy to increase stability of the BiCGstab
inverter.
- Added definition of QUDA_VERSION to quda.h. This is an integer with
two digits for each of the major, minor, and subminor version
numbers. For example, QUDA_VERSION is 000204 for this release.
Version 0.2.3 - 2 June 2010
- Further improved performance of the blas routines.
- Added 3D Wilson Dslash in anticipation of temporal preconditioning.
Version 0.2.2 - 16 February 2010
- Fixed a bug that prevented reductions (and hence the inverter) from working
correctly in emulation mode.
Version 0.2.1 - 8 February 2010
- Fixed a bug that would sometimes cause the inverter to fail when spinor
padding is enabled.
- Significantly improved performance of the blas routines.
Version 0.2 - 16 December 2009
- Introduced new interface functions newQudaGaugeParam() and
newQudaInvertParam() to allow for enhanced error checking. See
invert_test for an example of their use.
- Added auto-tuning blas to improve performance (see README for details).
- Improved stability of the half precision 8-parameter SU(3)
reconstruction (with thanks to Guochun Shi).
- Cleaned up the invert_test example to remove unnecessary dependencies.
- Fixed bug affecting saveGaugeQuda() that caused su3_test to fail.
- Tuned parameters to improve performance of the half-precision clover
Dslash on sm_13 hardware.
- Formally adopted the MIT/X11 license.
Version 0.1 - 17 November 2009
- Initial public release.