Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a job to select suspected doublet and screen / API to solve #274

Open
jluc2808 opened this issue Mar 6, 2016 · 2 comments
Open

add a job to select suspected doublet and screen / API to solve #274

jluc2808 opened this issue Mar 6, 2016 · 2 comments

Comments

@jluc2808
Copy link
Member

jluc2808 commented Mar 6, 2016

i alway find doublet in the person database and even with an handling cleaning i regulary solve around 20 doublet by week
i understand that doublet are created wit a lot of reason
after a long search , i have found a way to suspect doublet , the only field which could give a real way to have a doublet is the name field in person database , this one agregate lest name and first name (without identifier which is by construction unique)
this wouldn't solve all the cases but with my database (26000 person) give around 95% of the suspected doublet

by example for this 2 entries in the person database

'28964', '2016-01-15 07:45:07', '3', '2016-03-05 19:00:45', 'Stéphanie Pillonca est une réalisatrice et une actrice française.', NULL, 'Stéphanie Pillonca-Kervern', NULL, NULL, NULL, 'DONE', 'Stéphanie', 'Stephanie Pillonca-Kervern', 'Pillonca-Kervern', '2016-01-15 07:48:54', 'Stéphanie Pillonca-Kervern', '0', 'DONE', NULL

'30669', '2016-03-05 19:00:46', '2', '2016-03-05 19:05:05', 'Stéphanie Pillonca est une réalisatrice et une actrice française.', NULL, 'Stéphanie Pillonca-Kervern', NULL, NULL, NULL, 'DONE', 'Stéphanie', 'Stephanie Pillonca', 'Pillonca-Kervern', '2016-03-05 19:04:14', 'Stéphanie Pillonca-Kervern', '0', 'DONE', NULL

identifier field are respectivly : 'Stephanie Pillonca-Kervern' and 'Stephanie Pillonca'
name field are: 'Stephanie Pillonca-Kervern'

so my purpose is to add a task which could create a list of suspected doublet and an API or screen which allow end-user to solve doublet by applying doublet API based on the suspected doublet

@modmax
Copy link
Member

modmax commented Mar 8, 2016

First the cause of this error:
1.) Movie A is scanned with allocine, and retrieves "Stephanie Pillonca" as actors and stores it with identifier "Stephanie Pillonca"
2.) Movie B is scanned with another scanner but person "Stephanie Pillonca-Kervern" and stores it with identifier "Stephanie Pillonca-Kervern"
3.) Later on the allocine person scans both persons and set the correct name in both entries; so you have 2 persons, with different identifiers but same name ...

The problem is, that there is no unique ID for each person within every application; also the name is often not correct for the same person, i.e. 50 Cent, Curtis Jackson, Curtis "50 Cent" Jackson and so on ...

Just a doublet detection will not work; first there must be mechanism how duplicates can be stored, perpaps an own table with matching from "doublet identifier" to "correct identifier" so that later scans can use this information and find the correct person.

Further on: If person A is marked as doublet of Person B, then the associations fo videos must be adjusted ... but I think that needs a rework of the current handling

@jluc2808
Copy link
Member Author

jluc2808 commented Mar 8, 2016

as you discribe , we couldn't solve doublet while scanning , so the suggested doublet table is somehow the best way to resolve doublet in case of already doublet found
but that table doesn't solve the doublet not already solved

i just ran a little script to find doublet in my database with a comparison of name and i find 500 doublet
so we need a mecanism that scan for doublet and ask user for acknoledge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants