Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do one deteran the Memory Mapping and VRAM Temperature Reading? #20

Open
jjziets opened this issue Nov 14, 2023 · 12 comments
Open

How do one deteran the Memory Mapping and VRAM Temperature Reading? #20

jjziets opened this issue Nov 14, 2023 · 12 comments

Comments

@jjziets
Copy link

jjziets commented Nov 14, 2023

Hi. I have A5000 and 5000ADA' how would one get the offset?

@olealgoritme
Copy link
Owner

olealgoritme commented Nov 15, 2023

A4500 and A6000 have been tested. Try out those offsets and let us know 😺

@Jbkcrash2
Copy link

I have the same question for the 3060. I added a device line that I assume is close. At least the card is recognized now. However, I must have the offset incorrect as I always get 127C, which I surely hope isn't correct! :)

Here is the device table entry I added...

{ .offset = 0x0000E2A8, .dev_id = 0x228E, .vram = "GDDR6",  .arch = "GA106", .name =  "RTX 3060" },

Here is the output for my test setup...

jkinney@largeicus:~/workspace/gddr6$ sudo ./gddr6
Device: RTX 3060 GDDR6 (GA106 / 0x228e) pci=84:0:1
Device: RTX 3060 GDDR6 (GA106 / 0x228e) pci=3:0:1
Device: RTX 3060 GDDR6 (GA106 / 0x228e) pci=4:0:1
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |

Any help would be greatly appreciated.

@olealgoritme
Copy link
Owner

I have the same question for the 3060. I added a device line that I assume is close. At least the card is recognized now. However, I must have the offset incorrect as I always get 127C, which I surely hope isn't correct! :)

Here is the device table entry I added...

{ .offset = 0x0000E2A8, .dev_id = 0x228E, .vram = "GDDR6",  .arch = "GA106", .name =  "RTX 3060" },

Here is the output for my test setup...

jkinney@largeicus:~/workspace/gddr6$ sudo ./gddr6
Device: RTX 3060 GDDR6 (GA106 / 0x228e) pci=84:0:1
Device: RTX 3060 GDDR6 (GA106 / 0x228e) pci=3:0:1
Device: RTX 3060 GDDR6 (GA106 / 0x228e) pci=4:0:1
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |

Any help would be greatly appreciated.

127°C is definitely incorrect. I do not have access to an RTX 3060, sadly. Did you try the 0x0000EE50 offset which has been tested to work on some RTX 3070s?

@Jbkcrash2
Copy link

Thank you for the quick response. It was worth a shot, but unfortunately, the same 127C results.

Is there something I can scan or run to determine the value? I would be more than willing to put some effort into this. I also want to add the T4(GDDR6) next if we can.

jkinney@largeicus:~/workspace/gddr6$ git diff
diff --git a/gddr6.c b/gddr6.c
index 1e5d21d..0e04762 100644
--- a/gddr6.c
+++ b/gddr6.c
@@ -52,6 +52,7 @@ struct device dev_table[] =
     { .offset = 0x0000E2A8, .dev_id = 0x2216, .vram = "GDDR6X", .arch = "GA102", .name =  "RTX 3080 LHR" },
     { .offset = 0x0000EE50, .dev_id = 0x2484, .vram = "GDDR6",  .arch = "GA104", .name =  "RTX 3070" },
     { .offset = 0x0000EE50, .dev_id = 0x2488, .vram = "GDDR6",  .arch = "GA104", .name =  "RTX 3070 LHR" },
+    { .offset = 0x0000EE50, .dev_id = 0x228E, .vram = "GDDR6",  .arch = "GA106", .name =  "RTX 3060" },
     { .offset = 0x0000E2A8, .dev_id = 0x2531, .vram = "GDDR6",  .arch = "GA106", .name =  "RTX A2000" },
     { .offset = 0x0000E2A8, .dev_id = 0x2571, .vram = "GDDR6",  .arch = "GA106", .name =  "RTX A2000" },
     { .offset = 0x0000E2A8, .dev_id = 0x2232, .vram = "GDDR6",  .arch = "GA102", .name =  "RTX A4500" },
jkinney@largeicus:~/workspace/gddr6$ make clean; make; sudo ./gddr6
rm -f gddr6
gcc -std=c11 -O3 -Wall -Werror -Wextra -Wshadow -Wpointer-arith -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -Wvla -o gddr6 gddr6.c -lpci
Device: RTX 3060 GDDR6 (GA106 / 0x228e) pci=84:0:1
Device: RTX 3060 GDDR6 (GA106 / 0x228e) pci=3:0:1
Device: RTX 3060 GDDR6 (GA106 / 0x228e) pci=4:0:1
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |
VRAM Temps: | 127°C | 127°C | 127°C |

@weter11
Copy link

weter11 commented Jan 15, 2024

As I know there are only few RTX 3070 desktop have memory temperature sensors. Anything below (including RTX 3060 & 4060 desktop and mobile, including Ti variants) never had that, even hot spot temperature measurement not implemented on some RTX 3060 chips.
So RTX 30xx mobile don't have such sensors in any notebook (including my RTX 3070 - no info about hot spot & memory). RTX 4070 and up (mobile and desktop) have that.

@Jbkcrash2
Copy link

@weter11, that is a perfect point. So far, I have yet to find the address; it is highly possible it doesn't exist. I am fairly sure the cards aren't frying the VRAM, even during fine-tuning, but I want to do a copper shim mod and get some measurements before and after. I may be stuck with winging it! ;)

@weter11
Copy link

weter11 commented Jan 16, 2024

Technically there are a lot of ways to measure temperature, from thermocouple connected to multimeter or Arduino up to some cheap USB temperature controller from China, I don't see here any problems, only apply some glue or scotch to thermocouple for stable measurements. Also is it really that huge overheating, because first problems start when Nvidia start to use GDDR6X, which need a lot more power than GDDR6 (1.5-2x) and users of some cards with a thick thermal pads (3mm and up) saw huge memory temperatures and slowdowns in games and tests (result of bad engineering or testing).
PS: but on T4 it's really should be exchanged, pads lifespan would be ended in 1-2 years and max working temp for GDDR6 is 85°C for Samsung modules and 95°C for Micron.
PPS: apologize to devs, post wasn't about software, but some technical aspects of GDDR6.

@weter11
Copy link

weter11 commented Jan 16, 2024

I found a new info, that by standard every GDDR6 module should have temperature sensor to prevent self damage. In datasheets for Micron and Samsung modules they have such sensors.
My RTX 3070M contain the same Samsung modules as desktop, so or we need another offset or there simply no I2C between modules and GPU and Nvidia drivers don't have access to read values from modules and modules control itself w/o any external signal.
PS My previous information was based on HWinfo and GPU-Z data and devs reports; and they still not implemented GDDR6 memory temperature reading, so this reader is first!

@Jbkcrash2
Copy link

Jbkcrash2 commented Jan 17, 2024

@weter11 again, several good points, and yes, truly sorry that we are sidetracking this issue; still, this conversation is on point for the project as a whole.

I have contemplated a direct measure, as you mentioned. However, I don't feel that it genuinely tells me much about before and after, as it would clearly measure the benefit of the shim and its interface to the heatsink, but not necessarily the quality of the interface to the chip itself. I am still on the fence on this one.

Based on research, I agree with a possible poor selection of GDDR6X, but I don't have the details on timing, sourcing, etc. I feel they were about the speed and the marketability that "faster" has. With the commodity nature of these cards and my unintended use of them outside of gaming, mine is just an opinion. As we know, Nvidia isn't ecstatic about us using the 30XX cards this way, and maybe the gaming(speed) benefits outweigh the increased overall power and heat.

That being said, thanks for the heads-up on the T4s. I have not repasted/re-padded them yet. They are "refurbished" cards in a proper server(R740) at our cooled data center; I might have made an inappropriate assumption about "refurb." I did repaste and re-pad the 3060s from eBay. They were probably old mining cards and really needed the repaste work; everything had dried out. Since they are stuffed back to back in an HP Z840 and are not in the best deployment to stay cool, I am most concerned about them.

I may have to pull one of them out and stuff it into a Windows desktop to see if the temps show. That would at least guarantee we could solve this with the right offset.

@olealgoritme
Copy link
Owner

Please continue longer conversions in e.g GPU Unlocking @ Discord. https://discord.com/invite/JrSHQzte

@jjziets
Copy link
Author

jjziets commented Jan 30, 2024

my A6000 don't have the device 0x26B1 its 0x2230

@olealgoritme
Copy link
Owner

Just add a table entry with 0x2230, same offset and give it a try. If successful, send PR with changes. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants