Conversation
|
Can anybody help add reviewers? (Also approve running workflows) |
|
PTAL @alainfrisch @Drup @pmetzger |
src/syntax/ppx_sedlex.ml
Outdated
| with Not_found -> | ||
| err p.ppat_loc (Printf.sprintf "unbound regexp %s" x) | ||
| end | ||
| | Ppat_alias (p, ({ txt = x } as x_loc)) when allow_alias -> |
There was a problem hiding this comment.
Can we give a better error for Ppat_alias _ when not allow_alias ?
There was a problem hiding this comment.
This has now been addressed.
src/syntax/ppx_sedlex.ml
Outdated
| let tables = List.map table (get_tables ()) in | ||
| let funcs = | ||
| let loc = default_loc in | ||
| [%str let __sedlex_tl = function _ :: tl -> tl | _ -> assert false] |
There was a problem hiding this comment.
Can we put this in the lib, similar to what we do for Sedlexing.__private__next_int
There was a problem hiding this comment.
This comment is not relevant anymore
src/syntax/ppx_sedlex.ml
Outdated
| let brs = | ||
| Array.of_list | ||
| (List.map | ||
| (fun ((r, s), e) -> ((r, List.of_seq (StrLocSet.to_seq s)), e)) |
toots
left a comment
There was a problem hiding this comment.
I can't really comment on the implementation but, for the API, is there a benefit in exporting offset/length vs simply binding the resulting match string?
|
Could you give more details about the current implementation ? Why is it done this way ? Can't we update positions as we traverse the automation, instead of doing a second pass based on some trace ? Why is the trace processed in reverse order ? |
|
There is a lot of room for optimization. Offset tracking could be shared between bindings. For example, in Some offset can be computer statically. In the regexp above, start of a is 0, end of b is the end of the lexeme. Have you considered such optimizations ? |
Returns an unexpected result: |
|
I think I would be easier if your trace was maintaining pairs of pos, instead of pos + len. |
src/syntax/ppx_sedlex.ml
Outdated
| | [] -> e | ||
| | aliases -> | ||
| let loc = default_loc in | ||
| let _, action_offsets = Option.get offsets.(i) in |
There was a problem hiding this comment.
instead of matching on aliases above, we could directly match on offsets.(i)
let gen_aliases lexbuf offsets i e aliases =
match offsets.(i) with
| None -> e
| Some (_, action_offsets) ->
let loc = default_loc in
pexp_let ~loc Nonrecursive
[
value_binding ~loc
~pat:[%pat? __sedlex_offsets]
~expr:
(appfun (trace_fun i) [evar ~loc lexbuf; [%expr __sedlex_path]]);
]
@@ pexp_let ~loc Nonrecursive
(List.map
(fun { txt = alias; loc } ->
let start = Hashtbl.find action_offsets (alias, true) in
let stop = Hashtbl.find action_offsets (alias, false) in
value_binding ~loc ~pat:(pvar ~loc alias)
~expr:
[%expr
__sedlex_offsets.([%e eint ~loc start]),
__sedlex_offsets.([%e eint ~loc stop])
- __sedlex_offsets.([%e eint ~loc start])])
aliases)
@@ e| with Not_found -> Hashtbl.add action2cases action [i]) | ||
| actions) | ||
| cases; | ||
| let counter = ref 0 in |
There was a problem hiding this comment.
counter could be replace by Hastbl.length alias2offset I think
src/syntax/ppx_sedlex.ml
Outdated
| let offset_array = | ||
| value_binding ~loc | ||
| ~pat:[%pat? __sedlex_offsets] | ||
| ~expr:(pexp_array ~loc (List.init offsets_num (fun _ -> [%expr 0]))) |
There was a problem hiding this comment.
I would use an invalid position here (e.g. '-1`) as it would make it more obvious if we ever encounter a bug where position are not updated.
src/syntax/sedlex.ml
Outdated
| let node_j = Hashtbl.find nodes_idx from_node in | ||
| let rec dfs cset actions to_node = | ||
| let node_i = Hashtbl.find nodes_idx to_node in | ||
| try ignore (Hashtbl.find cases (i, node_i, j, cset)) |
src/syntax/sedlex.ml
Outdated
|
|
||
| let compile_traces states (start, final) = | ||
| let counter = ref 0 in | ||
| let nodes_idx = Hashtbl.create 31 in |
There was a problem hiding this comment.
How are nodes_idx different from node.id ?
There was a problem hiding this comment.
Right. I forgot there's already an id in node.
| let i = Hashtbl.find nodes_idx node in | ||
| let actions = append_action actions node.action in | ||
| let cases = | ||
| if node.trans <> [] || node == final then |
There was a problem hiding this comment.
Can you add a comment explaining these conditions ?
| let r1, s1 = aux allow_alias p1 in | ||
| let r2, s2 = aux allow_alias p2 in | ||
| if not (StrLocSet.equal s1 s2) then begin | ||
| let x = |
There was a problem hiding this comment.
What about providing the full list of problematic bindings ?
(e.g StrLocSet.(diff (union s1 s2) (inter s1 s2)))
| (fun (r1, s1) p -> | ||
| let r2, s2 = aux allow_alias p in | ||
| if not (StrLocSet.disjoint s1 s2) then begin | ||
| let x = StrLocSet.choose (StrLocSet.inter s1 s2) in |
There was a problem hiding this comment.
Same here, we could return the full list of duplicated bindings
There was a problem hiding this comment.
I think it's OK to report one location at a time, just like the OCaml compiler behaves. (One name corresponds to one location)
src/syntax/sedlex.ml
Outdated
| (fun to_state i -> | ||
| List.iter | ||
| (fun from_node -> | ||
| try |
There was a problem hiding this comment.
is this try-with necessary ?
There was a problem hiding this comment.
It caused some problem when I removed nodes_idx. Now it's fixed.
There was a problem hiding this comment.
Can you elaborate on the issue and explain what are the irrelevant nodes ?
| [ | ||
| pint ~loc curr_state; | ||
| pint ~loc curr_node; | ||
| pint ~loc prev_state; |
There was a problem hiding this comment.
is prev_state ever needed to choose the action ?
On a trivial example
let rec token buf =
match%sedlex buf with
| (Star "a" as a), (Plus "b" as b), "c" ->
print_alias buf "a" a;
print_alias buf "b" b;
token buf
| eof -> print_endline "EOF"
| _ -> failwith "Unexpected character"curr_state and curr_node are enough for choosing the action.
The following
let __sedlex_code = Sedlexing.lexeme_code buf (__sedlex_pos - 1) in
(match (__sedlex_curr_state, __sedlex_curr_node,
__sedlex_prev_state)
with
| (2, 6, 0) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest)
| (0, 6, 2) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest)
| (3, 6, 2) -> __sedlex_aux (__sedlex_pos - 1) 2 6 __sedlex_rest
| (4, 1, 3) -> __sedlex_aux (__sedlex_pos - 1) 3 2 __sedlex_rest
| (3, 2, 0) ->
(__sedlex_offsets.(1) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 0 6 __sedlex_rest)
| (3, 6, 0) -> __sedlex_aux (__sedlex_pos - 1) 0 6 __sedlex_rest
| (2, 6, 2) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest)
| (3, 2, 2) ->
(__sedlex_offsets.(1) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 2 6 __sedlex_rest)
| (3, 2, 3) ->
(__sedlex_offsets.(1) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 3 6 __sedlex_rest)
| (3, 6, 3) -> __sedlex_aux (__sedlex_pos - 1) 3 6 __sedlex_rest
| (0, 10, 2) -> __sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest
| (0, 6, 0) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest)
| (2, 10, 2) -> __sedlex_aux (__sedlex_pos - 1) 2 10 __sedlex_rest
| (0, 10, 0) -> __sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest
| (2, 10, 0) -> __sedlex_aux (__sedlex_pos - 1) 0 10 __sedlex_rest
| _ -> assert false) incan be rewritten into
let __sedlex_code = Sedlexing.lexeme_code buf (__sedlex_pos - 1) in
(match (__sedlex_curr_state, __sedlex_curr_node) with
| (2, 6) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
__sedlex_rest)
| (0, 6) ->
(__sedlex_offsets.(2) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
__sedlex_rest)
| (3, 6) ->
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 6
__sedlex_rest
| (4, 1) ->
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 2
__sedlex_rest
| (3, 2) ->
(__sedlex_offsets.(1) <- __sedlex_pos;
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 6
__sedlex_rest)
| (0, 10) ->
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
__sedlex_rest
| (2, 10) ->
__sedlex_aux (__sedlex_pos - 1) __sedlex_prev_state 10
__sedlex_rest
| _ -> assert false) inif prev_state is sometime useful, we could include it inside a guard only when needed.
There was a problem hiding this comment.
The prev_state is still useful. It's a good idea to make it a guard.
Let me make more tests.
|
FYI, if one really wants to support capture groups correctly: https://re2c.org/#papers has several must-read papers. |
|
@pmetzger Thanks for the link! |
| let r2, s2 = aux allow_alias p2 in | ||
| if not (StrLocSet.equal s1 s2) then begin | ||
| let x = | ||
| try StrLocSet.choose (StrLocSet.diff s1 s2) |
There was a problem hiding this comment.
I would avoid the exception based selection. and use
StrLocSet.choose StrLocSet.(diff (union s1 s2) (inter s1 s2))
I've added this in hhugo@0284bfd |
|
Hey there! Do we want to try & come to a consensus on wether this could be merged? |
|
@toots I think there's at least one code review comment above that hasn't yet been addressed? Generally I think we should merge though? |
|
I currently can't close/resolve threads. There are still a bunch unaddressed ones. |
|
replaced by #175 |
This PR tries to support named capture groups. Closes #5, closes #102.
Usages
In the example
Variable
digitsis defined as a(pos, len)pair, which can be passed as arguments toSedlexing.sub_lexemeto retrieve the group captured bydigits.The
assyntax is only allowed inmatch%sedlex ... with ...constructs. You cannot provide an alias inlet ... = [%sedlex.regexp? ...]. And it is not allowed inside constructors (i.e.Star,Plus,Rep,Opt,Compl,Sub,Intersect,Chars).Implementation
We are trying to restore the NFA node transitions path with the information of the DFA state transitions path.
Explanation
And finally,
is translated to
Note
When there are multiple available paths, any of them might be chosen. As in
One of the possible combinations of
aandbwill be returned.Others
There are some tests (
test/number_lexer.ml,test/misc.ml) currently. More to be added.