Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

killing a process not always remove all mnesia transaction locks causing few orphaned locks held for forever #9581

Open
amitabhadatta14 opened this issue Mar 12, 2025 · 5 comments
Assignees
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS

Comments

@amitabhadatta14
Copy link

amitabhadatta14 commented Mar 12, 2025

When killing multiple processes by a timer:kill_after(Ttimeout, kill) and the each process consist of mnesia transaction of a clustered database tables with write lock doesn't always release the locks upon getting EXIT signal from the Pid being killed itself. Some held locks are seen hung around from clustered node (peer) not the node where the process being killed with Pid starting non-zero number.

To reproduce, we need a at least two nodes mnesia cluster database table and from one node, create a process which will write data to a clustered table within a transaction. before the transaction statement, start a timer:exit_after(Timeout, kill). Set the timeout value to 1ms.

spawn this process inside a loop for 1000 times.

example

[my_proc() || _C <- lists:seq(1,1000)].

my_proc() ->
    Pid = self(),
    Timeout = 5, %% 5 ms
    process_flag(trap_exit, true),
    RPid = spawn_link(fun() -> Res = write_to_db(Key, Timeout),
                                                    Pid ! {results, Res}
                                     end),
     receive
             {results, Result} -> Result;
             {'EXIT', _, _} ->
                                        io:format("Processs has been killed. ~p~n",[RPid]),
                                        exit(RPid, timeout)
     end. 

write_to_db(Key, Timeout) ->
   {ok, Tref} =  timer:exit_after(Timeout, kill),
    MyFun = fun(Key) ->
              Data =  mnesia:read(Table1, Key, write),
               timer:sleep(10),
               mnesia:write(Table1, Data, write),
               timer:cancel(Tref) end,
   mnesia:activity(transaction, MyFun,
                           [Key], mnesia_frag).


After the execution, some held locks are seen hung for forever.

Expected behavior

All locks should have been release upon killing the process which consists of the mnesia transactions.

Affected versions
Tested from OTP 25 onwards - same results

@amitabhadatta14 amitabhadatta14 added the bug Issue is reported as a bug label Mar 12, 2025
@amitabhadatta14 amitabhadatta14 changed the title killing a process not always doesn't remove mnesia transaction locks causing orphaned held locks killing a process not always remove all mnesia transaction locks causing few orphaned locks helds forever Mar 12, 2025
@amitabhadatta14 amitabhadatta14 changed the title killing a process not always remove all mnesia transaction locks causing few orphaned locks helds forever killing a process not always remove all mnesia transaction locks causing few orphaned locks held for forever Mar 12, 2025
@IngelaAndin IngelaAndin added the team:PS Assigned to OTP team PS label Mar 14, 2025
@Mikaka27
Copy link
Contributor

Hi, I've added a testcase to otp based on your example, but didn't managed to reproduce it yet (probably wasn't lucky enough).
master...Mikaka27:otp:michal/mnesia/locks-leak-when-process-killed-during-transaction/OTP-19557

Please check if that's what you meant to test.
In the meantime, could you share the result of your reproduction? Which locks are stuck held forever?

@amitabhadatta14
Copy link
Author

amitabhadatta14 commented Mar 25, 2025

HI,

I have just re-produced the issue.

To re-produce the issue. you need the following structure of nested mnesia transaction

timer:exit_after(10, timeout),
transaction 1 {
                              Rec = mnesia:read(tab1, Key, write),
                              timer:sleep(15)
                              transaction 2 {
                                                          mnesia:write(tab1, rec, write)
                                                       } 
                        }

(4n1@icap600cx)23> rpc:multicall(mnesia,system_info,[held_locks]).
{[[],[],[],
[{{funds_DR_1_frag11,{"000000002A1","USD"}},
write,
{tid,6275,<8110.14752.1>}},
{{funds_DR_2,{"000000004A1","USD"}},
write,
{tid,8999,<8110.17444.1>}}],
[]],
[]}

logs from Node 3 which was processing the transaction and killed by timer

$ grep -r "<0.17444.1>"
4n1-0.1/log/erlang.log.1:Mnesia('4n1@icap600cx'): Transaction {tid,6468,<0.17444.1>} calling #Fun<flow_worker.0.42743720> with [] failed:
4n1-0.1/logs/audit/fps.open:25 Mar 2025-00:06:50.726700,INFO,1,IPSS0275,4n1@icap600cx,Mnesia info: Transaction {tid,6468,<0.17444.1>} calling #Fun<flow_worker.0.42743720> with [] failed:
4n3-0.1/log/erlang.log.1:Mnesia('4n3@icap600cx'): Restarting transaction {tid,9000,<0.17445.1>}: in 4ms {cyclic,'4n3@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},read,write,{tid,8999,<0.17444.1>}}
4n3-0.1/log/erlang.log.1:Mnesia('4n3@icap600cx'): Coordinator <0.17444.1> in transaction {tid,8999,
4n3-0.1/log/erlang.log.1: <0.17444.1>} died.
4n3-0.1/log/erlang.log.1:Mnesia('4n3@icap600cx'): Restarting transaction {tid,9022,<0.17486.1>}: in 7ms {cyclic,'4n4@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},write,write,{tid,8999,<0.17444.1>}}
4n3-0.1/log/erlang.log.1:Mnesia('4n3@icap600cx'): Restarting transaction {tid,9026,<0.17492.1>}: in 4ms {cyclic,'4n4@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},write,write,{tid,8999,<0.17444.1>}}
4n3-0.1/log/erlang.log.1:Mnesia('4n3@icap600cx'): Restarting transaction {tid,9038,<0.17510.1>}: in 4ms {cyclic,'4n4@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},write,write,{tid,8999,<0.17444.1>}}
4n3-0.1/log/erlang.log.1:Mnesia('4n3@icap600cx'): Restarting transaction {tid,9044,<0.17521.1>}: in 3ms {cyclic,'4n4@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},write,write,{tid,8999,<0.17444.1>}}
4n3-0.1/log/erlang.log.1:Mnesia('4n3@icap600cx'): Restarting transaction {tid,9095,<0.17635.1>}: in 8ms {cyclic,'4n4@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},write,write,{tid,8999,<0.17444.1>}}

logs from clustered node 4. There are thousands of times the transactions are restarts because of the orphaned lock that has been leftover by one of the killed process.

$ grep -r "<8110.17444.1>"
4n1-0.1/log/erlang.log.1:Killed Pid from check_funds_in_remote_site -> <8110.17444.1> on : '4n3@icap600cx'
4n1-0.1/log/erlang.log.1:check_funds_in_remote_site exiting at : "2025-03-25 00:06:45.565619000", Node : '4n3@icap600cx', txid : "20250325000000004A1B001000000001097", Pid : <8110.17444.1>
4n4-0.1/log/erlang.log.1:Mnesia('4n4@icap600cx'): Restarting transaction {tid,9039,<0.15770.1>}: in 3ms {cyclic,'4n4@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},read,write,{tid,8999,<8110.17444.1>}}
4n4-0.1/log/erlang.log.1:Mnesia('4n4@icap600cx'): Restarting transaction {tid,9048,<0.15774.1>}: in 9ms {cyclic,'4n4@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},read,write,{tid,8999,<8110.17444.1>}}
4n4-0.1/log/erlang.log.1:Mnesia('4n4@icap600cx'): Restarting transaction {tid,9051,<0.15777.1>}: in 5ms {cyclic,'4n4@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},read,write,{tid,8999,<8110.17444.1>}}
4n4-0.1/log/erlang.log.1:Mnesia('4n4@icap600cx'): Restarting transaction {tid,9057,<0.15780.1>}: in 7ms {cyclic,'4n4@icap600cx',{funds_DR_2,{[48,48,48,48,48,48,48,48,52,65,49],[85,83,68]}},read,write,{tid,8999,<8110.17444.1>}}

let me know anything I can be any help of you.

@amitabhadatta14
Copy link
Author

accidentally closed. Reopened.

@amitabhadatta14
Copy link
Author

amitabhadatta14 commented Mar 25, 2025

You need 4 nodes N1, N2 -> site1 and N3,N4 -> site2

test_table -> N1, N2

test_table -> N3, N4

new code ->

-module(test_locks).


%% Table : table1 ( node1, node2) -> site1, (node3, node4) -> site2
%%
%%-record(table1,
%%        {key :: string() | ms(),
%%         amount :: integer()  | ms()
%%        }).
%%
%%

-export([start_test/0,
        write_to_db/4,
        add_debit/2]).

start_test() ->
  _ =  [my_proc() || _C <- lists:seq(1,100000)].

my_proc() ->
    Pid = self(),
    Participant = "000000002A1",
    Amount = 999,
    Timeout = 1, %% 1 ms
    process_flag(trap_exit, true),
    RPid = spawn(fun() -> Res = write_to_db(Txid, Participant, Amount, Currcode, SettCycle, Timeout, Pid),
                                                    Pid ! {results, Res}
                                     end),
     receive
             {results, Result} -> io:format("Result : ~p~n",[Result]),
                                  Result;
             {'EXIT', _, _} ->
                                 io:format("Processs has been killed. ~p~n",[RPid]),
                                 exit(RPid, kill);
             V -> io:format("Recv ..... ~p~n",[V])
     end.

write_to_db(Participant, Amount, Timeout, Pid) ->
    erlang:send_after(Timeout, Pid, {'EXIT', self(), timeout}),
    LFun =
      fun() ->
         mnesia:read(table1, Participant, write),
         RemoteNodes = get_remote_nodes(),
         rpc_to_remote_notes(?MODULE, add_debit, [Participant, Amount], RemoteNodes)
      end,

add_debit(Participant, Amount) ->
    TFun =
       fun() ->
                  mnesia:read(table1, Participant, write), %% lock the record
                  Rec = mnesia:read(table1, Participant),
                  NewRec = Rec#table1{amount = amount + Amount},
                  mnesia:write(table1, NewRec, write)
        end,
        mnesia:activity(transaction, TFun,[], mnesia_frag).

rpc_to_remote_nodes(Module, Function, Parameters, []) ->
    {error, {cant_execute, Module, Function, Parameters}};
rpc_to_remote_nodes(Module, Function, Parameters, [Node|Nodes]) ->
    case rpc:call(Node, Module, Function, Parameters) of
        {badrpc, Rsn} ->
            io:format("badrpc : ~p~n",[Rsn]),
            rpc_to_remote_nodes(Module, Function, Parameters, Nodes);
        Res -> {ok, Res}
    end.

get_remote_nodes() ->
    case node() of
      Node when Node =:= 'n1' orelse
                Node =:= 'n2' -> ['n3', 'n4'];
      _ -> ['n1', 'n2']
    end.

@amitabhadatta14
Copy link
Author

you need to run the above code from node1 and node3 at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS
Projects
None yet
Development

No branches or pull requests

4 participants