-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error loading tables into omni sci in knn_model.py #13
Comments
Is your tablename and dataframe name both "m"? |
Yes. Is that an issue?
… On Sep 14, 2020, at 1:29 PM, dkakkar ***@***.***> wrote:
Is your tablename and dataframe name both "m"?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUE34IUH4PDMVTZDXZ3SFZHHVANCNFSM4RLYKCIA>.
|
Pls share your script with me on email. |
conn.load_table("mrg",m,create='infer',method='arrow'). Is this the line causing error? Could you try to print "m" by using and share the output: print(m.head()) |
Hi Devika,
Yes, it appears that is the line that is causing error. Here is the printed output:
>> m.head(5)
dpost rpost neighbor_id
0 0.0 0.0 AK-630667
1 0.0 0.0 AK-701587
2 0.0 0.0 AK-656813
3 0.0 0.0 AK-656812
4 0.0 0.0 AK-701520
… On Sep 14, 2020, at 1:40 PM, dkakkar ***@***.***> wrote:
conn.load_table("mrg",m,create='infer',method='arrow'). Is this the line causing error? Could you try to print "m" by using and share the output:
print(m.head())
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUFSEICPFAXX2WVUIE3SFZIQNANCNFSM4RLYKCIA>.
|
Try this instead of "conn.load_table("voters",df,create='infer',method='arrow')": conn.execute("Create table IF NOT EXISTS mrg (dpost FLOAT, rpost FLOAT, neighbor_id TEXT ENCODING NONE);") |
This seems to work. The current problem arises from reading in the knn output file (in this case it is knn_1000_CA1_2012.tar.gz). It seems to be timing out or hitting a memory limit?
>> df = pd.read_csv(filename, sep=',',dtype='unicode',index_col=None, low_memory='true',compression='gzip')
Killed
… On Sep 14, 2020, at 1:46 PM, dkakkar ***@***.***> wrote:
Try this instead of "conn.load_table("voters",df,create='infer',method='arrow')":
conn.execute("Create table IF NOT EXISTS mrg (dpost FLOAT, rpost FLOAT, neighbor_id TEXT ENCODING NONE);")
conn.load_table_columnar("mrg", m,preserve_index=False)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUGLJN7HMWRFL6X4TQLSFZJHVANCNFSM4RLYKCIA>.
|
What GPU memory are you using?Please send parameters of your job. |
Also, pls test the entire script with a smaller file so that we know if the problem is in the script or memory. |
In testing this on the Rhode Island file, I am able to load the knn_1000_RI1_2012.tar.gz file, but it does not look like the data frame we expect:
>> df.head(5)
knn_1000_RI_2012.csv
0 PA-000007920358\tPA-10407918\td\tr\t40.2433976...
1 PA-000007920358\tPA-10408513\td\tr\t40.2433976...
2 PA-000007920358\tPA-000006487459\td\td\t40.243...
3 PA-000007920358\tPA-000006909098\td\td\t40.243...
4 PA-000007920358\tPA-000000307624\td\tr\t40.243...
>>
I think this means the sep is “\t” not “,”?
… On Sep 14, 2020, at 5:24 PM, dkakkar ***@***.***> wrote:
Also, pls test the entire script with a smaller file so that we know if the problem is in the script or memory.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUFFH5DJ4OP6D5O5TADSF2CY3ANCNFSM4RLYKCIA>.
|
Yes, separator is '\t' but in your script you mentioned ',', no? |
So I can upload the file, but when I try to load it to Omni Sci the process dies:
>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE, neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
<pymapd.cursor.Cursor object at 0x2b96a3dd4048>
>> conn.load_table_columnar("knn", df,preserve_index=False)
Killed
… On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> wrote:
Yes, separator is '\t' but in your script you mentioned ',', no?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA>.
|
This seems like memory issue. Pls send me the parameters you used to launch
the job.
…On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***> wrote:
So I can upload the file, but when I try to load it to Omni Sci the
process dies:
>>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING
NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE,
neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
<pymapd.cursor.Cursor object at 0x2b96a3dd4048>
>>> conn.load_table_columnar("knn", df,preserve_index=False)
Killed
> On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> wrote:
>
>
> Yes, separator is '\t' but in your script you mentioned ',', no?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA>
.
|
Pls use 256 GB ram, 2 CPU, 1GPU machine.
On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar <[email protected]>
wrote:
… This seems like memory issue. Pls send me the parameters you used to
launch the job.
On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***>
wrote:
> So I can upload the file, but when I try to load it to Omni Sci the
> process dies:
>
> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING
> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE,
> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
> >>> conn.load_table_columnar("knn", df,preserve_index=False)
> Killed
>
> > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> wrote:
> >
> >
> > Yes, separator is '\t' but in your script you mentioned ',', no?
> >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #13 (comment)>,
> or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
> >.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA>
> .
>
|
That is what Im using, I believe.
… On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> wrote:
Pls use 256 GB ram, 2 CPU, 1GPU machine.
On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar ***@***.***>
wrote:
> This seems like memory issue. Pls send me the parameters you used to
> launch the job.
>
> On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***>
> wrote:
>
>> So I can upload the file, but when I try to load it to Omni Sci the
>> process dies:
>>
>> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING
>> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE,
>> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
>> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
>> >>> conn.load_table_columnar("knn", df,preserve_index=False)
>> Killed
>>
>> > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***> wrote:
>> >
>> >
>> > Yes, separator is '\t' but in your script you mentioned ',', no?
>> >
>> > —
>> > You are receiving this because you authored the thread.
>> > Reply to this email directly, view it on GitHub <
>> #13 (comment)>,
>> or unsubscribe <
>> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
>> >.
>> >
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <#13 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA>
>> .
>>
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Please recheck the parameters and if it still fails with 256 GB then first
check with FASRC help email if memory is the reason of it's failure. If
memory is the reason then you will have to divide the file in smaller
chunks to model it because FASRC does not allow more than 256GB on GPU.
While dividing into smaller chunks make sure you include all neighbors of a
voter in the file. For e.g if you take voter id 1 to 100 then the file
should have all 1000 neighbors for voter id 1-100 else the modelling will
be corrupt.
On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown <[email protected]>
wrote:
… That is what Im using, I believe.
> On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> wrote:
>
>
> Pls use 256 GB ram, 2 CPU, 1GPU machine.
>
> On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar ***@***.***>
> wrote:
>
> > This seems like memory issue. Pls send me the parameters you used to
> > launch the job.
> >
> > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***>
> > wrote:
> >
> >> So I can upload the file, but when I try to load it to Omni Sci the
> >> process dies:
> >>
> >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT
ENCODING
> >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE,
> >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
> >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
> >> >>> conn.load_table_columnar("knn", df,preserve_index=False)
> >> Killed
> >>
> >> > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***>
wrote:
> >> >
> >> >
> >> > Yes, separator is '\t' but in your script you mentioned ',', no?
> >> >
> >> > —
> >> > You are receiving this because you authored the thread.
> >> > Reply to this email directly, view it on GitHub <
> >>
#13 (comment)
>,
> >> or unsubscribe <
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
> >> >.
> >> >
> >>
> >> —
> >> You are receiving this because you commented.
> >> Reply to this email directly, view it on GitHub
> >> <
#13 (comment)
>,
> >> or unsubscribe
> >> <
https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA
>
> >> .
> >>
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub, or unsubscribe.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA>
.
|
Also, I would suggest testing with the smallest input file (smaller than
RI) you have in hand so that we are sure that the script is correct before
we solve the memory scaling issue.
On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar <[email protected]>
wrote:
… Please recheck the parameters and if it still fails with 256 GB then
first check with FASRC help email if memory is the reason of it's failure.
If memory is the reason then you will have to divide the file in smaller
chunks to model it because FASRC does not allow more than 256GB on GPU.
While dividing into smaller chunks make sure you include all neighbors of a
voter in the file. For e.g if you take voter id 1 to 100 then the file
should have all 1000 neighbors for voter id 1-100 else the modelling will
be corrupt.
On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.***>
wrote:
> That is what Im using, I believe.
>
> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> wrote:
> >
> >
> > Pls use 256 GB ram, 2 CPU, 1GPU machine.
> >
> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar ***@***.***
> >
> > wrote:
> >
> > > This seems like memory issue. Pls send me the parameters you used to
> > > launch the job.
> > >
> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***>
> > > wrote:
> > >
> > >> So I can upload the file, but when I try to load it to Omni Sci the
> > >> process dies:
> > >>
> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT
> ENCODING
> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE,
> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False)
> > >> Killed
> > >>
> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***>
> wrote:
> > >> >
> > >> >
> > >> > Yes, separator is '\t' but in your script you mentioned ',', no?
> > >> >
> > >> > —
> > >> > You are receiving this because you authored the thread.
> > >> > Reply to this email directly, view it on GitHub <
> > >>
> #13 (comment)
> >,
> > >> or unsubscribe <
> > >>
> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
> > >> >.
> > >> >
> > >>
> > >> —
> > >> You are receiving this because you commented.
> > >> Reply to this email directly, view it on GitHub
> > >> <
> #13 (comment)
> >,
> > >> or unsubscribe
> > >> <
> https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA
> >
> > >> .
> > >>
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub, or unsubscribe.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA>
> .
>
|
Okay. How do I go about dividing it? By creating smaller groups from the outset when generating knn output?
… On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***> wrote:
Also, I would suggest testing with the smallest input file (smaller than
RI) you have in hand so that we are sure that the script is correct before
we solve the memory scaling issue.
On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar ***@***.***>
wrote:
> Please recheck the parameters and if it still fails with 256 GB then
> first check with FASRC help email if memory is the reason of it's failure.
> If memory is the reason then you will have to divide the file in smaller
> chunks to model it because FASRC does not allow more than 256GB on GPU.
> While dividing into smaller chunks make sure you include all neighbors of a
> voter in the file. For e.g if you take voter id 1 to 100 then the file
> should have all 1000 neighbors for voter id 1-100 else the modelling will
> be corrupt.
>
> On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.***>
> wrote:
>
>> That is what Im using, I believe.
>>
>> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***> wrote:
>> >
>> >
>> > Pls use 256 GB ram, 2 CPU, 1GPU machine.
>> >
>> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar ***@***.***
>> >
>> > wrote:
>> >
>> > > This seems like memory issue. Pls send me the parameters you used to
>> > > launch the job.
>> > >
>> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown ***@***.***>
>> > > wrote:
>> > >
>> > >> So I can upload the file, but when I try to load it to Omni Sci the
>> > >> process dies:
>> > >>
>> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT
>> ENCODING
>> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING NONE,
>> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
>> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
>> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False)
>> > >> Killed
>> > >>
>> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar ***@***.***>
>> wrote:
>> > >> >
>> > >> >
>> > >> > Yes, separator is '\t' but in your script you mentioned ',', no?
>> > >> >
>> > >> > —
>> > >> > You are receiving this because you authored the thread.
>> > >> > Reply to this email directly, view it on GitHub <
>> > >>
>> #13 (comment)
>> >,
>> > >> or unsubscribe <
>> > >>
>> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
>> > >> >.
>> > >> >
>> > >>
>> > >> —
>> > >> You are receiving this because you commented.
>> > >> Reply to this email directly, view it on GitHub
>> > >> <
>> #13 (comment)
>> >,
>> > >> or unsubscribe
>> > >> <
>> https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA
>> >
>> > >> .
>> > >>
>> > >
>> > —
>> > You are receiving this because you authored the thread.
>> > Reply to this email directly, view it on GitHub, or unsubscribe.
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <#13 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA>
>> .
>>
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Yes, that is what you would have to do ultimately for bigger files. Please
divide it in smaller groups and try again but before that check with FASRC
is memory is indeed the issue even with 256GB.
On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown <[email protected]>
wrote:
… Okay. How do I go about dividing it? By creating smaller groups from the
outset when generating knn output?
> On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***> wrote:
>
>
> Also, I would suggest testing with the smallest input file (smaller than
> RI) you have in hand so that we are sure that the script is correct
before
> we solve the memory scaling issue.
>
> On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar ***@***.***
>
> wrote:
>
> > Please recheck the parameters and if it still fails with 256 GB then
> > first check with FASRC help email if memory is the reason of it's
failure.
> > If memory is the reason then you will have to divide the file in
smaller
> > chunks to model it because FASRC does not allow more than 256GB on GPU.
> > While dividing into smaller chunks make sure you include all neighbors
of a
> > voter in the file. For e.g if you take voter id 1 to 100 then the file
> > should have all 1000 neighbors for voter id 1-100 else the modelling
will
> > be corrupt.
> >
> > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.***
>
> > wrote:
> >
> >> That is what Im using, I believe.
> >>
> >> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***>
wrote:
> >> >
> >> >
> >> > Pls use 256 GB ram, 2 CPU, 1GPU machine.
> >> >
> >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar <
***@***.***
> >> >
> >> > wrote:
> >> >
> >> > > This seems like memory issue. Pls send me the parameters you used
to
> >> > > launch the job.
> >> > >
> >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown <
***@***.***>
> >> > > wrote:
> >> > >
> >> > >> So I can upload the file, but when I try to load it to Omni Sci
the
> >> > >> process dies:
> >> > >>
> >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT
> >> ENCODING
> >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING
NONE,
> >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
> >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
> >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False)
> >> > >> Killed
> >> > >>
> >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar <
***@***.***>
> >> wrote:
> >> > >> >
> >> > >> >
> >> > >> > Yes, separator is '\t' but in your script you mentioned ',',
no?
> >> > >> >
> >> > >> > —
> >> > >> > You are receiving this because you authored the thread.
> >> > >> > Reply to this email directly, view it on GitHub <
> >> > >>
> >>
#13 (comment)
> >> >,
> >> > >> or unsubscribe <
> >> > >>
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
> >> > >> >.
> >> > >> >
> >> > >>
> >> > >> —
> >> > >> You are receiving this because you commented.
> >> > >> Reply to this email directly, view it on GitHub
> >> > >> <
> >>
#13 (comment)
> >> >,
> >> > >> or unsubscribe
> >> > >> <
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA
> >> >
> >> > >> .
> >> > >>
> >> > >
> >> > —
> >> > You are receiving this because you authored the thread.
> >> > Reply to this email directly, view it on GitHub, or unsubscribe.
> >>
> >> —
> >> You are receiving this because you commented.
> >> Reply to this email directly, view it on GitHub
> >> <
#13 (comment)
>,
> >> or unsubscribe
> >> <
https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA
>
> >> .
> >>
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub, or unsubscribe.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA>
.
|
Okay I am in the process of re-running it. I should also note that the FASRC overall session does not die, just the python3 session activated by the knn_model.py script.
… On Sep 15, 2020, at 10:29 AM, dkakkar ***@***.***> wrote:
Yes, that is what you would have to do ultimately for bigger files. Please
divide it in smaller groups and try again but before that check with FASRC
is memory is indeed the issue even with 256GB.
On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown ***@***.***>
wrote:
> Okay. How do I go about dividing it? By creating smaller groups from the
> outset when generating knn output?
>
> > On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***> wrote:
> >
> >
> > Also, I would suggest testing with the smallest input file (smaller than
> > RI) you have in hand so that we are sure that the script is correct
> before
> > we solve the memory scaling issue.
> >
> > On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar ***@***.***
> >
> > wrote:
> >
> > > Please recheck the parameters and if it still fails with 256 GB then
> > > first check with FASRC help email if memory is the reason of it's
> failure.
> > > If memory is the reason then you will have to divide the file in
> smaller
> > > chunks to model it because FASRC does not allow more than 256GB on GPU.
> > > While dividing into smaller chunks make sure you include all neighbors
> of a
> > > voter in the file. For e.g if you take voter id 1 to 100 then the file
> > > should have all 1000 neighbors for voter id 1-100 else the modelling
> will
> > > be corrupt.
> > >
> > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.***
> >
> > > wrote:
> > >
> > >> That is what Im using, I believe.
> > >>
> > >> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***>
> wrote:
> > >> >
> > >> >
> > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine.
> > >> >
> > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar <
> ***@***.***
> > >> >
> > >> > wrote:
> > >> >
> > >> > > This seems like memory issue. Pls send me the parameters you used
> to
> > >> > > launch the job.
> > >> > >
> > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown <
> ***@***.***>
> > >> > > wrote:
> > >> > >
> > >> > >> So I can upload the file, but when I try to load it to Omni Sci
> the
> > >> > >> process dies:
> > >> > >>
> > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT
> > >> ENCODING
> > >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING
> NONE,
> > >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
> > >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
> > >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False)
> > >> > >> Killed
> > >> > >>
> > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar <
> ***@***.***>
> > >> wrote:
> > >> > >> >
> > >> > >> >
> > >> > >> > Yes, separator is '\t' but in your script you mentioned ',',
> no?
> > >> > >> >
> > >> > >> > —
> > >> > >> > You are receiving this because you authored the thread.
> > >> > >> > Reply to this email directly, view it on GitHub <
> > >> > >>
> > >>
> #13 (comment)
> > >> >,
> > >> > >> or unsubscribe <
> > >> > >>
> > >>
> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
> > >> > >> >.
> > >> > >> >
> > >> > >>
> > >> > >> —
> > >> > >> You are receiving this because you commented.
> > >> > >> Reply to this email directly, view it on GitHub
> > >> > >> <
> > >>
> #13 (comment)
> > >> >,
> > >> > >> or unsubscribe
> > >> > >> <
> > >>
> https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA
> > >> >
> > >> > >> .
> > >> > >>
> > >> > >
> > >> > —
> > >> > You are receiving this because you authored the thread.
> > >> > Reply to this email directly, view it on GitHub, or unsubscribe.
> > >>
> > >> —
> > >> You are receiving this because you commented.
> > >> Reply to this email directly, view it on GitHub
> > >> <
> #13 (comment)
> >,
> > >> or unsubscribe
> > >> <
> https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA
> >
> > >> .
> > >>
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub, or unsubscribe.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA>.
|
If overall session does not die then it might not be GPU memory issue.
Please run python script in screen .
…On Tue, Sep 15, 2020, 11:03 AM Jacob Brown ***@***.***> wrote:
Okay I am in the process of re-running it. I should also note that the
FASRC overall session does not die, just the python3 session activated by
the knn_model.py script.
> On Sep 15, 2020, at 10:29 AM, dkakkar ***@***.***> wrote:
>
>
> Yes, that is what you would have to do ultimately for bigger files.
Please
> divide it in smaller groups and try again but before that check with
FASRC
> is memory is indeed the issue even with 256GB.
>
> On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown ***@***.***>
> wrote:
>
> > Okay. How do I go about dividing it? By creating smaller groups from
the
> > outset when generating knn output?
> >
> > > On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***>
wrote:
> > >
> > >
> > > Also, I would suggest testing with the smallest input file (smaller
than
> > > RI) you have in hand so that we are sure that the script is correct
> > before
> > > we solve the memory scaling issue.
> > >
> > > On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar <
***@***.***
> > >
> > > wrote:
> > >
> > > > Please recheck the parameters and if it still fails with 256 GB
then
> > > > first check with FASRC help email if memory is the reason of it's
> > failure.
> > > > If memory is the reason then you will have to divide the file in
> > smaller
> > > > chunks to model it because FASRC does not allow more than 256GB on
GPU.
> > > > While dividing into smaller chunks make sure you include all
neighbors
> > of a
> > > > voter in the file. For e.g if you take voter id 1 to 100 then the
file
> > > > should have all 1000 neighbors for voter id 1-100 else the
modelling
> > will
> > > > be corrupt.
> > > >
> > > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown <
***@***.***
> > >
> > > > wrote:
> > > >
> > > >> That is what Im using, I believe.
> > > >>
> > > >> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.***
>
> > wrote:
> > > >> >
> > > >> >
> > > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine.
> > > >> >
> > > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar <
> > ***@***.***
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > This seems like memory issue. Pls send me the parameters you
used
> > to
> > > >> > > launch the job.
> > > >> > >
> > > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown <
> > ***@***.***>
> > > >> > > wrote:
> > > >> > >
> > > >> > >> So I can upload the file, but when I try to load it to Omni
Sci
> > the
> > > >> > >> process dies:
> > > >> > >>
> > > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id
TEXT
> > > >> ENCODING
> > > >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT
ENCODING
> > NONE,
> > > >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
> > > >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
> > > >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False)
> > > >> > >> Killed
> > > >> > >>
> > > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar <
> > ***@***.***>
> > > >> wrote:
> > > >> > >> >
> > > >> > >> >
> > > >> > >> > Yes, separator is '\t' but in your script you mentioned
',',
> > no?
> > > >> > >> >
> > > >> > >> > —
> > > >> > >> > You are receiving this because you authored the thread.
> > > >> > >> > Reply to this email directly, view it on GitHub <
> > > >> > >>
> > > >>
> >
#13 (comment)
> > > >> >,
> > > >> > >> or unsubscribe <
> > > >> > >>
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
> > > >> > >> >.
> > > >> > >> >
> > > >> > >>
> > > >> > >> —
> > > >> > >> You are receiving this because you commented.
> > > >> > >> Reply to this email directly, view it on GitHub
> > > >> > >> <
> > > >>
> >
#13 (comment)
> > > >> >,
> > > >> > >> or unsubscribe
> > > >> > >> <
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA
> > > >> >
> > > >> > >> .
> > > >> > >>
> > > >> > >
> > > >> > —
> > > >> > You are receiving this because you authored the thread.
> > > >> > Reply to this email directly, view it on GitHub, or unsubscribe.
> > > >>
> > > >> —
> > > >> You are receiving this because you commented.
> > > >> Reply to this email directly, view it on GitHub
> > > >> <
> >
#13 (comment)
> > >,
> > > >> or unsubscribe
> > > >> <
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA
> > >
> > > >> .
> > > >>
> > > >
> > > —
> > > You are receiving this because you authored the thread.
> > > Reply to this email directly, view it on GitHub, or unsubscribe.
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <
#13 (comment)
>,
> > or unsubscribe
> > <
https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA
>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2GS44PACFRHFGY2GDTSF565PANCNFSM4RLYKCIA>
.
|
Yes I am running it in screen.
… On Sep 15, 2020, at 11:05 AM, Devika Kakkar ***@***.***> wrote:
If overall session does not die then it might not be GPU memory issue. Please run python script in screen .
On Tue, Sep 15, 2020, 11:03 AM Jacob Brown ***@***.*** ***@***.***>> wrote:
Okay I am in the process of re-running it. I should also note that the FASRC overall session does not die, just the python3 session activated by the knn_model.py script.
> On Sep 15, 2020, at 10:29 AM, dkakkar ***@***.*** ***@***.***>> wrote:
>
>
> Yes, that is what you would have to do ultimately for bigger files. Please
> divide it in smaller groups and try again but before that check with FASRC
> is memory is indeed the issue even with 256GB.
>
> On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown ***@***.*** ***@***.***>>
> wrote:
>
> > Okay. How do I go about dividing it? By creating smaller groups from the
> > outset when generating knn output?
> >
> > > On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.*** ***@***.***>> wrote:
> > >
> > >
> > > Also, I would suggest testing with the smallest input file (smaller than
> > > RI) you have in hand so that we are sure that the script is correct
> > before
> > > we solve the memory scaling issue.
> > >
> > > On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar ***@***.*** ***@***.***>
> > >
> > > wrote:
> > >
> > > > Please recheck the parameters and if it still fails with 256 GB then
> > > > first check with FASRC help email if memory is the reason of it's
> > failure.
> > > > If memory is the reason then you will have to divide the file in
> > smaller
> > > > chunks to model it because FASRC does not allow more than 256GB on GPU.
> > > > While dividing into smaller chunks make sure you include all neighbors
> > of a
> > > > voter in the file. For e.g if you take voter id 1 to 100 then the file
> > > > should have all 1000 neighbors for voter id 1-100 else the modelling
> > will
> > > > be corrupt.
> > > >
> > > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown ***@***.*** ***@***.***>
> > >
> > > > wrote:
> > > >
> > > >> That is what Im using, I believe.
> > > >>
> > > >> > On Sep 15, 2020, at 10:17 AM, dkakkar ***@***.*** ***@***.***>>
> > wrote:
> > > >> >
> > > >> >
> > > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine.
> > > >> >
> > > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar <
> > ***@***.*** ***@***.***>
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > This seems like memory issue. Pls send me the parameters you used
> > to
> > > >> > > launch the job.
> > > >> > >
> > > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown <
> > ***@***.*** ***@***.***>>
> > > >> > > wrote:
> > > >> > >
> > > >> > >> So I can upload the file, but when I try to load it to Omni Sci
> > the
> > > >> > >> process dies:
> > > >> > >>
> > > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id TEXT
> > > >> ENCODING
> > > >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT ENCODING
> > NONE,
> > > >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
> > > >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
> > > >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False)
> > > >> > >> Killed
> > > >> > >>
> > > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar <
> > ***@***.*** ***@***.***>>
> > > >> wrote:
> > > >> > >> >
> > > >> > >> >
> > > >> > >> > Yes, separator is '\t' but in your script you mentioned ',',
> > no?
> > > >> > >> >
> > > >> > >> > —
> > > >> > >> > You are receiving this because you authored the thread.
> > > >> > >> > Reply to this email directly, view it on GitHub <
> > > >> > >>
> > > >>
> > #13 (comment) <#13 (comment)>
> > > >> >,
> > > >> > >> or unsubscribe <
> > > >> > >>
> > > >>
> > https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA>
> > > >> > >> >.
> > > >> > >> >
> > > >> > >>
> > > >> > >> —
> > > >> > >> You are receiving this because you commented.
> > > >> > >> Reply to this email directly, view it on GitHub
> > > >> > >> <
> > > >>
> > #13 (comment) <#13 (comment)>
> > > >> >,
> > > >> > >> or unsubscribe
> > > >> > >> <
> > > >>
> > https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA>
> > > >> >
> > > >> > >> .
> > > >> > >>
> > > >> > >
> > > >> > —
> > > >> > You are receiving this because you authored the thread.
> > > >> > Reply to this email directly, view it on GitHub, or unsubscribe.
> > > >>
> > > >> —
> > > >> You are receiving this because you commented.
> > > >> Reply to this email directly, view it on GitHub
> > > >> <
> > #13 (comment) <#13 (comment)>
> > >,
> > > >> or unsubscribe
> > > >> <
> > https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA>
> > >
> > > >> .
> > > >>
> > > >
> > > —
> > > You are receiving this because you authored the thread.
> > > Reply to this email directly, view it on GitHub, or unsubscribe.
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#13 (comment) <#13 (comment)>>,
> > or unsubscribe
> > <https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA>>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <#13 (comment) <#13 (comment)>>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA>>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GS44PACFRHFGY2GDTSF565PANCNFSM4RLYKCIA>.
|
Then your dataframe is running out of memory to read the whole file at once
since it's too big. Please read it in chunks, look into chunksize option
while using Pandas dataframe to modify the script:
pd.read_csv(filename, chunksize=chunksize)
…On Tue, Sep 15, 2020 at 11:06 AM Jacob Brown ***@***.***> wrote:
Yes I am running it in screen.
On Sep 15, 2020, at 11:05 AM, Devika Kakkar ***@***.***>
wrote:
If overall session does not die then it might not be GPU memory issue.
Please run python script in screen .
On Tue, Sep 15, 2020, 11:03 AM Jacob Brown ***@***.***>
wrote:
>
> Okay I am in the process of re-running it. I should also note that the
> FASRC overall session does not die, just the python3 session activated by
> the knn_model.py script.
>
> > On Sep 15, 2020, at 10:29 AM, dkakkar ***@***.***> wrote:
> >
> >
> > Yes, that is what you would have to do ultimately for bigger files.
> Please
> > divide it in smaller groups and try again but before that check with
> FASRC
> > is memory is indeed the issue even with 256GB.
> >
> > On Tue, Sep 15, 2020 at 10:27 AM Jacob Brown ***@***.***>
> > wrote:
> >
> > > Okay. How do I go about dividing it? By creating smaller groups from
> the
> > > outset when generating knn output?
> > >
> > > > On Sep 15, 2020, at 10:25 AM, dkakkar ***@***.***>
> wrote:
> > > >
> > > >
> > > > Also, I would suggest testing with the smallest input file (smaller
> than
> > > > RI) you have in hand so that we are sure that the script is correct
> > > before
> > > > we solve the memory scaling issue.
> > > >
> > > > On Tue, Sep 15, 2020 at 10:22 AM Devika Kakkar <
> ***@***.***
> > > >
> > > > wrote:
> > > >
> > > > > Please recheck the parameters and if it still fails with 256 GB
> then
> > > > > first check with FASRC help email if memory is the reason of it's
> > > failure.
> > > > > If memory is the reason then you will have to divide the file in
> > > smaller
> > > > > chunks to model it because FASRC does not allow more than 256GB
> on GPU.
> > > > > While dividing into smaller chunks make sure you include all
> neighbors
> > > of a
> > > > > voter in the file. For e.g if you take voter id 1 to 100 then the
> file
> > > > > should have all 1000 neighbors for voter id 1-100 else the
> modelling
> > > will
> > > > > be corrupt.
> > > > >
> > > > > On Tue, Sep 15, 2020 at 10:19 AM Jacob Brown <
> ***@***.***
> > > >
> > > > > wrote:
> > > > >
> > > > >> That is what Im using, I believe.
> > > > >>
> > > > >> > On Sep 15, 2020, at 10:17 AM, dkakkar <
> ***@***.***>
> > > wrote:
> > > > >> >
> > > > >> >
> > > > >> > Pls use 256 GB ram, 2 CPU, 1GPU machine.
> > > > >> >
> > > > >> > On Tue, Sep 15, 2020 at 7:56 AM Devika Kakkar <
> > > ***@***.***
> > > > >> >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > This seems like memory issue. Pls send me the parameters you
> used
> > > to
> > > > >> > > launch the job.
> > > > >> > >
> > > > >> > > On Tue, Sep 15, 2020, 12:50 AM Jacob Brown <
> > > ***@***.***>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > >> So I can upload the file, but when I try to load it to Omni
> Sci
> > > the
> > > > >> > >> process dies:
> > > > >> > >>
> > > > >> > >> >>> conn.execute("Create table IF NOT EXISTS knn (source_id
> TEXT
> > > > >> ENCODING
> > > > >> > >> NONE, neighbor_id TEXT ENCODING NONE, source_pid TEXT
> ENCODING
> > > NONE,
> > > > >> > >> neighbor_pid TEXT ENCODING NONE, dist FLOAT);")
> > > > >> > >> <pymapd.cursor.Cursor object at 0x2b96a3dd4048>
> > > > >> > >> >>> conn.load_table_columnar("knn", df,preserve_index=False)
> > > > >> > >> Killed
> > > > >> > >>
> > > > >> > >> > On Sep 14, 2020, at 11:59 PM, dkakkar <
> > > ***@***.***>
> > > > >> wrote:
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > Yes, separator is '\t' but in your script you mentioned
> ',',
> > > no?
> > > > >> > >> >
> > > > >> > >> > —
> > > > >> > >> > You are receiving this because you authored the thread.
> > > > >> > >> > Reply to this email directly, view it on GitHub <
> > > > >> > >>
> > > > >>
> > >
> #13 (comment)
> > > > >> >,
> > > > >> > >> or unsubscribe <
> > > > >> > >>
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUAMMSDWHEHWFC7C7LTSF3RC3ANCNFSM4RLYKCIA
> > > > >> > >> >.
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >> —
> > > > >> > >> You are receiving this because you commented.
> > > > >> > >> Reply to this email directly, view it on GitHub
> > > > >> > >> <
> > > > >>
> > >
> #13 (comment)
> > > > >> >,
> > > > >> > >> or unsubscribe
> > > > >> > >> <
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2FRWP3HYHSVN6FMWN3SF3XA7ANCNFSM4RLYKCIA
> > > > >> >
> > > > >> > >> .
> > > > >> > >>
> > > > >> > >
> > > > >> > —
> > > > >> > You are receiving this because you authored the thread.
> > > > >> > Reply to this email directly, view it on GitHub, or
> unsubscribe.
> > > > >>
> > > > >> —
> > > > >> You are receiving this because you commented.
> > > > >> Reply to this email directly, view it on GitHub
> > > > >> <
> > >
> #13 (comment)
> > > >,
> > > > >> or unsubscribe
> > > > >> <
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2AJ7MI6CH5XG7CVF23SF5ZVPANCNFSM4RLYKCIA
> > > >
> > > > >> .
> > > > >>
> > > > >
> > > > —
> > > > You are receiving this because you authored the thread.
> > > > Reply to this email directly, view it on GitHub, or unsubscribe.
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub
> > > <
> #13 (comment)
> >,
> > > or unsubscribe
> > > <
> https://github.com/notifications/unsubscribe-auth/ACWCV2FAM6VLQYCAQSMKDFDSF52UZANCNFSM4RLYKCIA
> >
> > > .
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #13 (comment)>,
> or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AILUUUCERR5EMMPVKO35LWLSF525ZANCNFSM4RLYKCIA
> >.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2GS44PACFRHFGY2GDTSF565PANCNFSM4RLYKCIA>
> .
>
|
Okay, thanks Devika. This might solve one issue but also recall that last night the process died while reading one of the smaller tables (RI) into OmniSci, so after successfully loading it into the Python environment.
… On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> wrote:
Then your dataframe is running out of memory to read the whole file at once
since it's too big. Please read it in chunks, look into chunksize option
while using Pandas dataframe to modify the script:
pd.read_csv(filename, chunksize=chunksize)
|
I think it is a memory issue. Please divide the file in smaller size and
try again and let's see what happens.
On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <[email protected]>
wrote:
… Okay, thanks Devika. This might solve one issue but also recall that last
night the process died while reading one of the smaller tables (RI) into
OmniSci, so after successfully loading it into the Python environment.
> On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> wrote:
>
> Then your dataframe is running out of memory to read the whole file at
once
> since it's too big. Please read it in chunks, look into chunksize option
> while using Pandas dataframe to modify the script:
>
> pd.read_csv(filename, chunksize=chunksize)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA>
.
|
Hi Devika,
After looking at this more one of the issues might have to do with how it is being read into Python. When I read in the tarred file directly into python, there is a weird value in the first row and first column intersection. This does not occur if I first unzip the file and then load the .csv into Python. Why might this be happening? See below:
>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None)
df.head()
>> df.head()
0 1 2 3 4 5 6
0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N
1 AK-787334 AK-706032 i r 0 \N \N
2 AK-787334 AK-647339 i r 0 \N \N
3 AK-787334 AK-618324 i i 0 \N \N
4 AK-787334 DC-567085 i i 0 \N \N
Compared to this when reading in the unzipped file:
>> df = pd.read_csv('knn_1000_AK1_2012.csv', sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None)
>> df.head()
0 1 2 3 4 5 6
0 AK-787334 AK-709502 i d 0 \N \N
1 AK-787334 AK-706032 i r 0 \N \N
2 AK-787334 AK-647339 i r 0 \N \N
3 AK-787334 AK-618324 i i 0 \N \N
4 AK-787334 DC-567085 i i 0 \N \N
…>>
On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote:
I think it is a memory issue. Please divide the file in smaller size and
try again and let's see what happens.
On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown ***@***.***>
wrote:
> Okay, thanks Devika. This might solve one issue but also recall that last
> night the process died while reading one of the smaller tables (RI) into
> OmniSci, so after successfully loading it into the Python environment.
>
> > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***> wrote:
> >
> > Then your dataframe is running out of memory to read the whole file at
> once
> > since it's too big. Please read it in chunks, look into chunksize option
> > while using Pandas dataframe to modify the script:
> >
> > pd.read_csv(filename, chunksize=chunksize)
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA>.
|
You are ready .tar.gz compressed file but in your dataframe read CSV you
are mentioning .gz compressed. This is causing the problem. Could you look
into how to read .tar.gz compression to dataframe.
On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <[email protected]>
wrote:
… Hi Devika,
After looking at this more one of the issues might have to do with how it
is being read into Python. When I read in the tarred file directly into
python, there is a weird value in the first row and first column
intersection. This does not occur if I first unzip the file and then load
the .csv into Python. Why might this be happening? See below:
>>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
sep='\t',dtype='unicode',index_col=None,
low_memory='true',compression='gzip', header=None)
df.head()
>>> df.head()
0 1 2 3 4 5 6
0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N
1 AK-787334 AK-706032 i r 0 \N \N
2 AK-787334 AK-647339 i r 0 \N \N
3 AK-787334 AK-618324 i i 0 \N \N
4 AK-787334 DC-567085 i i 0 \N \N
Compared to this when reading in the unzipped file:
>>> df = pd.read_csv('knn_1000_AK1_2012.csv',
sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None)
>>> df.head()
0 1 2 3 4 5 6
0 AK-787334 AK-709502 i d 0 \N \N
1 AK-787334 AK-706032 i r 0 \N \N
2 AK-787334 AK-647339 i r 0 \N \N
3 AK-787334 AK-618324 i i 0 \N \N
4 AK-787334 DC-567085 i i 0 \N \N
>>>
> On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote:
>
>
> I think it is a memory issue. Please divide the file in smaller size and
> try again and let's see what happens.
>
> On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown ***@***.***>
> wrote:
>
> > Okay, thanks Devika. This might solve one issue but also recall that
last
> > night the process died while reading one of the smaller tables (RI)
into
> > OmniSci, so after successfully loading it into the Python environment.
> >
> > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***>
wrote:
> > >
> > > Then your dataframe is running out of memory to read the whole file
at
> > once
> > > since it's too big. Please read it in chunks, look into chunksize
option
> > > while using Pandas dataframe to modify the script:
> > >
> > > pd.read_csv(filename, chunksize=chunksize)
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <
#13 (comment)
>,
> > or unsubscribe
> > <
https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA>
.
|
Thanks ill look into this. Is one potential solution also zipping the file such that it only has the extension .gz?
… On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote:
You are ready .tar.gz compressed file but in your dataframe read CSV you
are mentioning .gz compressed. This is causing the problem. Could you look
into how to read .tar.gz compression to dataframe.
On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***>
wrote:
> Hi Devika,
>
> After looking at this more one of the issues might have to do with how it
> is being read into Python. When I read in the tarred file directly into
> python, there is a weird value in the first row and first column
> intersection. This does not occur if I first unzip the file and then load
> the .csv into Python. Why might this be happening? See below:
>
> >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> sep='\t',dtype='unicode',index_col=None,
> low_memory='true',compression='gzip', header=None)
> df.head()
> >>> df.head()
> 0 1 2 3 4 5 6
> 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N
> 1 AK-787334 AK-706032 i r 0 \N \N
> 2 AK-787334 AK-647339 i r 0 \N \N
> 3 AK-787334 AK-618324 i i 0 \N \N
> 4 AK-787334 DC-567085 i i 0 \N \N
>
>
> Compared to this when reading in the unzipped file:
>
> >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None)
> >>> df.head()
> 0 1 2 3 4 5 6
> 0 AK-787334 AK-709502 i d 0 \N \N
> 1 AK-787334 AK-706032 i r 0 \N \N
> 2 AK-787334 AK-647339 i r 0 \N \N
> 3 AK-787334 AK-618324 i i 0 \N \N
> 4 AK-787334 DC-567085 i i 0 \N \N
> >>>
>
>
> > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote:
> >
> >
> > I think it is a memory issue. Please divide the file in smaller size and
> > try again and let's see what happens.
> >
> > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown ***@***.***>
> > wrote:
> >
> > > Okay, thanks Devika. This might solve one issue but also recall that
> last
> > > night the process died while reading one of the smaller tables (RI)
> into
> > > OmniSci, so after successfully loading it into the Python environment.
> > >
> > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***>
> wrote:
> > > >
> > > > Then your dataframe is running out of memory to read the whole file
> at
> > > once
> > > > since it's too big. Please read it in chunks, look into chunksize
> option
> > > > while using Pandas dataframe to modify the script:
> > > >
> > > > pd.read_csv(filename, chunksize=chunksize)
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub
> > > <
> #13 (comment)
> >,
> > > or unsubscribe
> > > <
> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> >
> > > .
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #13 (comment)>,
> or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> >.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA>.
|
Yes.
On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <[email protected]>
wrote:
… Thanks ill look into this. Is one potential solution also zipping the file
such that it only has the extension .gz?
> On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote:
>
>
> You are ready .tar.gz compressed file but in your dataframe read CSV you
> are mentioning .gz compressed. This is causing the problem. Could you
look
> into how to read .tar.gz compression to dataframe.
>
> On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***>
> wrote:
>
> > Hi Devika,
> >
> > After looking at this more one of the issues might have to do with how
it
> > is being read into Python. When I read in the tarred file directly into
> > python, there is a weird value in the first row and first column
> > intersection. This does not occur if I first unzip the file and then
load
> > the .csv into Python. Why might this be happening? See below:
> >
> > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > sep='\t',dtype='unicode',index_col=None,
> > low_memory='true',compression='gzip', header=None)
> > df.head()
> > >>> df.head()
> > 0 1 2 3 4 5 6
> > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N
\N
> > 1 AK-787334 AK-706032 i r 0 \N \N
> > 2 AK-787334 AK-647339 i r 0 \N \N
> > 3 AK-787334 AK-618324 i i 0 \N \N
> > 4 AK-787334 DC-567085 i i 0 \N \N
> >
> >
> > Compared to this when reading in the unzipped file:
> >
> > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None)
> > >>> df.head()
> > 0 1 2 3 4 5 6
> > 0 AK-787334 AK-709502 i d 0 \N \N
> > 1 AK-787334 AK-706032 i r 0 \N \N
> > 2 AK-787334 AK-647339 i r 0 \N \N
> > 3 AK-787334 AK-618324 i i 0 \N \N
> > 4 AK-787334 DC-567085 i i 0 \N \N
> > >>>
> >
> >
> > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***>
wrote:
> > >
> > >
> > > I think it is a memory issue. Please divide the file in smaller size
and
> > > try again and let's see what happens.
> > >
> > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
***@***.***>
> > > wrote:
> > >
> > > > Okay, thanks Devika. This might solve one issue but also recall
that
> > last
> > > > night the process died while reading one of the smaller tables (RI)
> > into
> > > > OmniSci, so after successfully loading it into the Python
environment.
> > > >
> > > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***>
> > wrote:
> > > > >
> > > > > Then your dataframe is running out of memory to read the whole
file
> > at
> > > > once
> > > > > since it's too big. Please read it in chunks, look into chunksize
> > option
> > > > > while using Pandas dataframe to modify the script:
> > > > >
> > > > > pd.read_csv(filename, chunksize=chunksize)
> > > >
> > > > —
> > > > You are receiving this because you commented.
> > > > Reply to this email directly, view it on GitHub
> > > > <
> >
#13 (comment)
> > >,
> > > > or unsubscribe
> > > > <
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > >
> > > > .
> > > >
> > > —
> > > You are receiving this because you authored the thread.
> > > Reply to this email directly, view it on GitHub <
> >
#13 (comment)
>,
> > or unsubscribe <
> >
https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > >.
> > >
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <
#13 (comment)
>,
> > or unsubscribe
> > <
https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA>
.
|
This actually did not appear to have solved the issue, as we still have the filename in the first row/column:
>> df = pd.read_csv('knn_1000_AK1_2012.gz', sep='\t',dtype='unicode',index_col=None, low_memory='true',compression='gzip', header=None)
df.head()
>> df.head()
0 1 2 3 4 5 6
0 knn_1000_AK1_2012.csv AK-709502 i d 0 \N \N
1 AK-787334 AK-706032 i r 0 \N \N
2 AK-787334 AK-647339 i r 0 \N \N
3 AK-787334 AK-618324 i i 0 \N \N
4 AK-787334 DC-567085 i i 0 \N \N
… On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote:
You are ready .tar.gz compressed file but in your dataframe read CSV you
are mentioning .gz compressed. This is causing the problem. Could you look
into how to read .tar.gz compression to dataframe.
On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***>
wrote:
> Hi Devika,
>
> After looking at this more one of the issues might have to do with how it
> is being read into Python. When I read in the tarred file directly into
> python, there is a weird value in the first row and first column
> intersection. This does not occur if I first unzip the file and then load
> the .csv into Python. Why might this be happening? See below:
>
> >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> sep='\t',dtype='unicode',index_col=None,
> low_memory='true',compression='gzip', header=None)
> df.head()
> >>> df.head()
> 0 1 2 3 4 5 6
> 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N \N
> 1 AK-787334 AK-706032 i r 0 \N \N
> 2 AK-787334 AK-647339 i r 0 \N \N
> 3 AK-787334 AK-618324 i i 0 \N \N
> 4 AK-787334 DC-567085 i i 0 \N \N
>
>
> Compared to this when reading in the unzipped file:
>
> >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None)
> >>> df.head()
> 0 1 2 3 4 5 6
> 0 AK-787334 AK-709502 i d 0 \N \N
> 1 AK-787334 AK-706032 i r 0 \N \N
> 2 AK-787334 AK-647339 i r 0 \N \N
> 3 AK-787334 AK-618324 i i 0 \N \N
> 4 AK-787334 DC-567085 i i 0 \N \N
> >>>
>
>
> > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***> wrote:
> >
> >
> > I think it is a memory issue. Please divide the file in smaller size and
> > try again and let's see what happens.
> >
> > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown ***@***.***>
> > wrote:
> >
> > > Okay, thanks Devika. This might solve one issue but also recall that
> last
> > > night the process died while reading one of the smaller tables (RI)
> into
> > > OmniSci, so after successfully loading it into the Python environment.
> > >
> > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***>
> wrote:
> > > >
> > > > Then your dataframe is running out of memory to read the whole file
> at
> > > once
> > > > since it's too big. Please read it in chunks, look into chunksize
> option
> > > > while using Pandas dataframe to modify the script:
> > > >
> > > > pd.read_csv(filename, chunksize=chunksize)
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub
> > > <
> #13 (comment)
> >,
> > > or unsubscribe
> > > <
> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> >
> > > .
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #13 (comment)>,
> or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> >.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA>.
|
Hi Devika,
You can disregard my last email, I am still troubleshooting some things I’ll give a full report in a few hours.
Thanks,
Jake
… On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> wrote:
Yes.
On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.***>
wrote:
> Thanks ill look into this. Is one potential solution also zipping the file
> such that it only has the extension .gz?
>
> > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote:
> >
> >
> > You are ready .tar.gz compressed file but in your dataframe read CSV you
> > are mentioning .gz compressed. This is causing the problem. Could you
> look
> > into how to read .tar.gz compression to dataframe.
> >
> > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***>
> > wrote:
> >
> > > Hi Devika,
> > >
> > > After looking at this more one of the issues might have to do with how
> it
> > > is being read into Python. When I read in the tarred file directly into
> > > python, there is a weird value in the first row and first column
> > > intersection. This does not occur if I first unzip the file and then
> load
> > > the .csv into Python. Why might this be happening? See below:
> > >
> > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > > sep='\t',dtype='unicode',index_col=None,
> > > low_memory='true',compression='gzip', header=None)
> > > df.head()
> > > >>> df.head()
> > > 0 1 2 3 4 5 6
> > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N
> \N
> > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > 4 AK-787334 DC-567085 i i 0 \N \N
> > >
> > >
> > > Compared to this when reading in the unzipped file:
> > >
> > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > > sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None)
> > > >>> df.head()
> > > 0 1 2 3 4 5 6
> > > 0 AK-787334 AK-709502 i d 0 \N \N
> > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > >>>
> > >
> > >
> > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***>
> wrote:
> > > >
> > > >
> > > > I think it is a memory issue. Please divide the file in smaller size
> and
> > > > try again and let's see what happens.
> > > >
> > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> ***@***.***>
> > > > wrote:
> > > >
> > > > > Okay, thanks Devika. This might solve one issue but also recall
> that
> > > last
> > > > > night the process died while reading one of the smaller tables (RI)
> > > into
> > > > > OmniSci, so after successfully loading it into the Python
> environment.
> > > > >
> > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***>
> > > wrote:
> > > > > >
> > > > > > Then your dataframe is running out of memory to read the whole
> file
> > > at
> > > > > once
> > > > > > since it's too big. Please read it in chunks, look into chunksize
> > > option
> > > > > > while using Pandas dataframe to modify the script:
> > > > > >
> > > > > > pd.read_csv(filename, chunksize=chunksize)
> > > > >
> > > > > —
> > > > > You are receiving this because you commented.
> > > > > Reply to this email directly, view it on GitHub
> > > > > <
> > >
> #13 (comment)
> > > >,
> > > > > or unsubscribe
> > > > > <
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > > >
> > > > > .
> > > > >
> > > > —
> > > > You are receiving this because you authored the thread.
> > > > Reply to this email directly, view it on GitHub <
> > >
> #13 (comment)
> >,
> > > or unsubscribe <
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > > >.
> > > >
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub
> > > <
> #13 (comment)
> >,
> > > or unsubscribe
> > > <
> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> >
> > > .
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #13 (comment)>,
> or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> >.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA>.
|
Please share the file with me.
…On Tue, Sep 15, 2020, 1:33 PM Jacob Brown ***@***.***> wrote:
This actually did not appear to have solved the issue, as we still have
the filename in the first row/column:
>>> df = pd.read_csv('knn_1000_AK1_2012.gz',
sep='\t',dtype='unicode',index_col=None,
low_memory='true',compression='gzip', header=None)
df.head()
>>> df.head()
0 1 2 3 4 5 6
0 knn_1000_AK1_2012.csv AK-709502 i d 0 \N \N
1 AK-787334 AK-706032 i r 0 \N \N
2 AK-787334 AK-647339 i r 0 \N \N
3 AK-787334 AK-618324 i i 0 \N \N
4 AK-787334 DC-567085 i i 0 \N \N
> On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***> wrote:
>
>
> You are ready .tar.gz compressed file but in your dataframe read CSV you
> are mentioning .gz compressed. This is causing the problem. Could you
look
> into how to read .tar.gz compression to dataframe.
>
> On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown ***@***.***>
> wrote:
>
> > Hi Devika,
> >
> > After looking at this more one of the issues might have to do with how
it
> > is being read into Python. When I read in the tarred file directly into
> > python, there is a weird value in the first row and first column
> > intersection. This does not occur if I first unzip the file and then
load
> > the .csv into Python. Why might this be happening? See below:
> >
> > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > sep='\t',dtype='unicode',index_col=None,
> > low_memory='true',compression='gzip', header=None)
> > df.head()
> > >>> df.head()
> > 0 1 2 3 4 5 6
> > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d 0 \N
\N
> > 1 AK-787334 AK-706032 i r 0 \N \N
> > 2 AK-787334 AK-647339 i r 0 \N \N
> > 3 AK-787334 AK-618324 i i 0 \N \N
> > 4 AK-787334 DC-567085 i i 0 \N \N
> >
> >
> > Compared to this when reading in the unzipped file:
> >
> > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > sep='\t',dtype='unicode',index_col=None, low_memory='true',header=None)
> > >>> df.head()
> > 0 1 2 3 4 5 6
> > 0 AK-787334 AK-709502 i d 0 \N \N
> > 1 AK-787334 AK-706032 i r 0 \N \N
> > 2 AK-787334 AK-647339 i r 0 \N \N
> > 3 AK-787334 AK-618324 i i 0 \N \N
> > 4 AK-787334 DC-567085 i i 0 \N \N
> > >>>
> >
> >
> > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***>
wrote:
> > >
> > >
> > > I think it is a memory issue. Please divide the file in smaller size
and
> > > try again and let's see what happens.
> > >
> > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
***@***.***>
> > > wrote:
> > >
> > > > Okay, thanks Devika. This might solve one issue but also recall
that
> > last
> > > > night the process died while reading one of the smaller tables (RI)
> > into
> > > > OmniSci, so after successfully loading it into the Python
environment.
> > > >
> > > > > On Sep 15, 2020, at 11:09 AM, dkakkar ***@***.***>
> > wrote:
> > > > >
> > > > > Then your dataframe is running out of memory to read the whole
file
> > at
> > > > once
> > > > > since it's too big. Please read it in chunks, look into chunksize
> > option
> > > > > while using Pandas dataframe to modify the script:
> > > > >
> > > > > pd.read_csv(filename, chunksize=chunksize)
> > > >
> > > > —
> > > > You are receiving this because you commented.
> > > > Reply to this email directly, view it on GitHub
> > > > <
> >
#13 (comment)
> > >,
> > > > or unsubscribe
> > > > <
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > >
> > > > .
> > > >
> > > —
> > > You are receiving this because you authored the thread.
> > > Reply to this email directly, view it on GitHub <
> >
#13 (comment)
>,
> > or unsubscribe <
> >
https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > >.
> > >
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <
#13 (comment)
>,
> > or unsubscribe
> > <
https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2E6YFNPDPLTT72NLPDSF6QO3ANCNFSM4RLYKCIA>
.
|
Sure, take your time.
On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <[email protected]>
wrote:
… Hi Devika,
You can disregard my last email, I am still troubleshooting some things
I’ll give a full report in a few hours.
Thanks,
Jake
> On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> wrote:
>
>
> Yes.
>
> On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.***>
> wrote:
>
> > Thanks ill look into this. Is one potential solution also zipping the
file
> > such that it only has the extension .gz?
> >
> > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***>
wrote:
> > >
> > >
> > > You are ready .tar.gz compressed file but in your dataframe read CSV
you
> > > are mentioning .gz compressed. This is causing the problem. Could you
> > look
> > > into how to read .tar.gz compression to dataframe.
> > >
> > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
***@***.***>
> > > wrote:
> > >
> > > > Hi Devika,
> > > >
> > > > After looking at this more one of the issues might have to do with
how
> > it
> > > > is being read into Python. When I read in the tarred file directly
into
> > > > python, there is a weird value in the first row and first column
> > > > intersection. This does not occur if I first unzip the file and
then
> > load
> > > > the .csv into Python. Why might this be happening? See below:
> > > >
> > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > > > sep='\t',dtype='unicode',index_col=None,
> > > > low_memory='true',compression='gzip', header=None)
> > > > df.head()
> > > > >>> df.head()
> > > > 0 1 2 3 4 5 6
> > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d
0 \N
> > \N
> > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > >
> > > >
> > > > Compared to this when reading in the unzipped file:
> > > >
> > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > > > sep='\t',dtype='unicode',index_col=None,
low_memory='true',header=None)
> > > > >>> df.head()
> > > > 0 1 2 3 4 5 6
> > > > 0 AK-787334 AK-709502 i d 0 \N \N
> > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > >>>
> > > >
> > > >
> > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***>
> > wrote:
> > > > >
> > > > >
> > > > > I think it is a memory issue. Please divide the file in smaller
size
> > and
> > > > > try again and let's see what happens.
> > > > >
> > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> > ***@***.***>
> > > > > wrote:
> > > > >
> > > > > > Okay, thanks Devika. This might solve one issue but also recall
> > that
> > > > last
> > > > > > night the process died while reading one of the smaller tables
(RI)
> > > > into
> > > > > > OmniSci, so after successfully loading it into the Python
> > environment.
> > > > > >
> > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
***@***.***>
> > > > wrote:
> > > > > > >
> > > > > > > Then your dataframe is running out of memory to read the
whole
> > file
> > > > at
> > > > > > once
> > > > > > > since it's too big. Please read it in chunks, look into
chunksize
> > > > option
> > > > > > > while using Pandas dataframe to modify the script:
> > > > > > >
> > > > > > > pd.read_csv(filename, chunksize=chunksize)
> > > > > >
> > > > > > —
> > > > > > You are receiving this because you commented.
> > > > > > Reply to this email directly, view it on GitHub
> > > > > > <
> > > >
> >
#13 (comment)
> > > > >,
> > > > > > or unsubscribe
> > > > > > <
> > > >
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > > > >
> > > > > > .
> > > > > >
> > > > > —
> > > > > You are receiving this because you authored the thread.
> > > > > Reply to this email directly, view it on GitHub <
> > > >
> >
#13 (comment)
> > >,
> > > > or unsubscribe <
> > > >
> >
https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > > > >.
> > > > >
> > > >
> > > > —
> > > > You are receiving this because you commented.
> > > > Reply to this email directly, view it on GitHub
> > > > <
> >
#13 (comment)
> > >,
> > > > or unsubscribe
> > > > <
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> > >
> > > > .
> > > >
> > > —
> > > You are receiving this because you authored the thread.
> > > Reply to this email directly, view it on GitHub <
> >
#13 (comment)
>,
> > or unsubscribe <
> >
https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> > >.
> > >
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <
#13 (comment)
>,
> > or unsubscribe
> > <
https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA>
.
|
Hi Devika,
So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:
>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);")
Traceback (most recent call last):
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute
at_most_n=-1,
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute
return self.recv_sql_execute()
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute
return c.execute(operation, parameters=parameters)
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.
I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:
>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);")
Traceback (most recent call last):
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute
at_most_n=-1,
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute
return self.recv_sql_execute()
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute
return c.execute(operation, parameters=parameters)
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet
… On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> wrote:
Sure, take your time.
On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.***>
wrote:
> Hi Devika,
>
> You can disregard my last email, I am still troubleshooting some things
> I’ll give a full report in a few hours.
>
> Thanks,
>
> Jake
>
> > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***> wrote:
> >
> >
> > Yes.
> >
> > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.***>
> > wrote:
> >
> > > Thanks ill look into this. Is one potential solution also zipping the
> file
> > > such that it only has the extension .gz?
> > >
> > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***>
> wrote:
> > > >
> > > >
> > > > You are ready .tar.gz compressed file but in your dataframe read CSV
> you
> > > > are mentioning .gz compressed. This is causing the problem. Could you
> > > look
> > > > into how to read .tar.gz compression to dataframe.
> > > >
> > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> ***@***.***>
> > > > wrote:
> > > >
> > > > > Hi Devika,
> > > > >
> > > > > After looking at this more one of the issues might have to do with
> how
> > > it
> > > > > is being read into Python. When I read in the tarred file directly
> into
> > > > > python, there is a weird value in the first row and first column
> > > > > intersection. This does not occur if I first unzip the file and
> then
> > > load
> > > > > the .csv into Python. Why might this be happening? See below:
> > > > >
> > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > > > > sep='\t',dtype='unicode',index_col=None,
> > > > > low_memory='true',compression='gzip', header=None)
> > > > > df.head()
> > > > > >>> df.head()
> > > > > 0 1 2 3 4 5 6
> > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d
> 0 \N
> > > \N
> > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > >
> > > > >
> > > > > Compared to this when reading in the unzipped file:
> > > > >
> > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > > > > sep='\t',dtype='unicode',index_col=None,
> low_memory='true',header=None)
> > > > > >>> df.head()
> > > > > 0 1 2 3 4 5 6
> > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > > >>>
> > > > >
> > > > >
> > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.***>
> > > wrote:
> > > > > >
> > > > > >
> > > > > > I think it is a memory issue. Please divide the file in smaller
> size
> > > and
> > > > > > try again and let's see what happens.
> > > > > >
> > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> > > ***@***.***>
> > > > > > wrote:
> > > > > >
> > > > > > > Okay, thanks Devika. This might solve one issue but also recall
> > > that
> > > > > last
> > > > > > > night the process died while reading one of the smaller tables
> (RI)
> > > > > into
> > > > > > > OmniSci, so after successfully loading it into the Python
> > > environment.
> > > > > > >
> > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> ***@***.***>
> > > > > wrote:
> > > > > > > >
> > > > > > > > Then your dataframe is running out of memory to read the
> whole
> > > file
> > > > > at
> > > > > > > once
> > > > > > > > since it's too big. Please read it in chunks, look into
> chunksize
> > > > > option
> > > > > > > > while using Pandas dataframe to modify the script:
> > > > > > > >
> > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> > > > > > >
> > > > > > > —
> > > > > > > You are receiving this because you commented.
> > > > > > > Reply to this email directly, view it on GitHub
> > > > > > > <
> > > > >
> > >
> #13 (comment)
> > > > > >,
> > > > > > > or unsubscribe
> > > > > > > <
> > > > >
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > > > > >
> > > > > > > .
> > > > > > >
> > > > > > —
> > > > > > You are receiving this because you authored the thread.
> > > > > > Reply to this email directly, view it on GitHub <
> > > > >
> > >
> #13 (comment)
> > > >,
> > > > > or unsubscribe <
> > > > >
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > > > > >.
> > > > > >
> > > > >
> > > > > —
> > > > > You are receiving this because you commented.
> > > > > Reply to this email directly, view it on GitHub
> > > > > <
> > >
> #13 (comment)
> > > >,
> > > > > or unsubscribe
> > > > > <
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> > > >
> > > > > .
> > > > >
> > > > —
> > > > You are receiving this because you authored the thread.
> > > > Reply to this email directly, view it on GitHub <
> > >
> #13 (comment)
> >,
> > > or unsubscribe <
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> > > >.
> > > >
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub
> > > <
> #13 (comment)
> >,
> > > or unsubscribe
> > > <
> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
> >
> > > .
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #13 (comment)>,
> or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
> >.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA>.
|
What is the data type for source_id?
On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown <[email protected]>
wrote:
… Hi Devika,
So I have figured out how to handle reading in the zipped files, and I
have been able to read in some of the smaller files to both Python and
OmniSci. The issues I am running into now involve running the modeling code
you provided, as am getting errors related to grouping on string columns.
You can see that output below:
>>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) as
mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost *
1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost *
1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);")
Traceback (most recent call last):
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
line 118, in execute
at_most_n=-1,
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
line 1755, in sql_execute
return self.recv_sql_execute()
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException:
TOmniSciException(error_msg='Exception: Cannot group by string columns
which are not dictionary encoded.')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
line 390, in execute
return c.execute(operation, parameters=parameters)
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Cannot group by string columns which
are not dictionary encoded.
I also got an error that I could not join tables using TEXT type variables
in OmniSci. This occurred when I was trying to merge in the new rpost and
dpost values:
>>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON
knn.neighbor_id = mrg.neighbor_id);")
Traceback (most recent call last):
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
line 118, in execute
at_most_n=-1,
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
line 1755, in sql_execute
return self.recv_sql_execute()
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException:
TOmniSciException(error_msg='Exception: Projection type TEXT not supported
for outer joins yet')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
line 390, in execute
return c.execute(operation, parameters=parameters)
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Projection type TEXT not supported for
outer joins yet
> On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> wrote:
>
>
> Sure, take your time.
>
> On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.***>
> wrote:
>
> > Hi Devika,
> >
> > You can disregard my last email, I am still troubleshooting some things
> > I’ll give a full report in a few hours.
> >
> > Thanks,
> >
> > Jake
> >
> > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***>
wrote:
> > >
> > >
> > > Yes.
> > >
> > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <
***@***.***>
> > > wrote:
> > >
> > > > Thanks ill look into this. Is one potential solution also zipping
the
> > file
> > > > such that it only has the extension .gz?
> > > >
> > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***>
> > wrote:
> > > > >
> > > > >
> > > > > You are ready .tar.gz compressed file but in your dataframe read
CSV
> > you
> > > > > are mentioning .gz compressed. This is causing the problem.
Could you
> > > > look
> > > > > into how to read .tar.gz compression to dataframe.
> > > > >
> > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> > ***@***.***>
> > > > > wrote:
> > > > >
> > > > > > Hi Devika,
> > > > > >
> > > > > > After looking at this more one of the issues might have to do
with
> > how
> > > > it
> > > > > > is being read into Python. When I read in the tarred file
directly
> > into
> > > > > > python, there is a weird value in the first row and first
column
> > > > > > intersection. This does not occur if I first unzip the file and
> > then
> > > > load
> > > > > > the .csv into Python. Why might this be happening? See below:
> > > > > >
> > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > > > > > sep='\t',dtype='unicode',index_col=None,
> > > > > > low_memory='true',compression='gzip', header=None)
> > > > > > df.head()
> > > > > > >>> df.head()
> > > > > > 0 1 2 3 4 5 6
> > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502
i d
> > 0 \N
> > > > \N
> > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > > >
> > > > > >
> > > > > > Compared to this when reading in the unzipped file:
> > > > > >
> > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > > > > > sep='\t',dtype='unicode',index_col=None,
> > low_memory='true',header=None)
> > > > > > >>> df.head()
> > > > > > 0 1 2 3 4 5 6
> > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > > > >>>
> > > > > >
> > > > > >
> > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar <
***@***.***>
> > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > I think it is a memory issue. Please divide the file in
smaller
> > size
> > > > and
> > > > > > > try again and let's see what happens.
> > > > > > >
> > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> > > > ***@***.***>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Okay, thanks Devika. This might solve one issue but also
recall
> > > > that
> > > > > > last
> > > > > > > > night the process died while reading one of the smaller
tables
> > (RI)
> > > > > > into
> > > > > > > > OmniSci, so after successfully loading it into the Python
> > > > environment.
> > > > > > > >
> > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> > ***@***.***>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Then your dataframe is running out of memory to read the
> > whole
> > > > file
> > > > > > at
> > > > > > > > once
> > > > > > > > > since it's too big. Please read it in chunks, look into
> > chunksize
> > > > > > option
> > > > > > > > > while using Pandas dataframe to modify the script:
> > > > > > > > >
> > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> > > > > > > >
> > > > > > > > —
> > > > > > > > You are receiving this because you commented.
> > > > > > > > Reply to this email directly, view it on GitHub
> > > > > > > > <
> > > > > >
> > > >
> >
#13 (comment)
> > > > > > >,
> > > > > > > > or unsubscribe
> > > > > > > > <
> > > > > >
> > > >
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > > > > > >
> > > > > > > > .
> > > > > > > >
> > > > > > > —
> > > > > > > You are receiving this because you authored the thread.
> > > > > > > Reply to this email directly, view it on GitHub <
> > > > > >
> > > >
> >
#13 (comment)
> > > > >,
> > > > > > or unsubscribe <
> > > > > >
> > > >
> >
https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > > > > > >.
> > > > > > >
> > > > > >
> > > > > > —
> > > > > > You are receiving this because you commented.
> > > > > > Reply to this email directly, view it on GitHub
> > > > > > <
> > > >
> >
#13 (comment)
> > > > >,
> > > > > > or unsubscribe
> > > > > > <
> > > >
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> > > > >
> > > > > > .
> > > > > >
> > > > > —
> > > > > You are receiving this because you authored the thread.
> > > > > Reply to this email directly, view it on GitHub <
> > > >
> >
#13 (comment)
> > >,
> > > > or unsubscribe <
> > > >
> >
https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> > > > >.
> > > > >
> > > >
> > > > —
> > > > You are receiving this because you commented.
> > > > Reply to this email directly, view it on GitHub
> > > > <
> >
#13 (comment)
> > >,
> > > > or unsubscribe
> > > > <
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
> > >
> > > > .
> > > >
> > > —
> > > You are receiving this because you authored the thread.
> > > Reply to this email directly, view it on GitHub <
> >
#13 (comment)
>,
> > or unsubscribe <
> >
https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
> > >.
> > >
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <
#13 (comment)
>,
> > or unsubscribe
> > <
https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA
>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA>
.
|
The data type for source_id is STR
… On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***> wrote:
What is the data type for source_id?
On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.*** ***@***.***>> wrote:
Hi Devika,
So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:
>>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);")
Traceback (most recent call last):
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute
at_most_n=-1,
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute
return self.recv_sql_execute()
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute
return c.execute(operation, parameters=parameters)
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.
I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:
>>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);")
Traceback (most recent call last):
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute
at_most_n=-1,
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute
return self.recv_sql_execute()
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute
return c.execute(operation, parameters=parameters)
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet
> On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.*** ***@***.***>> wrote:
>
>
> Sure, take your time.
>
> On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.*** ***@***.***>>
> wrote:
>
> > Hi Devika,
> >
> > You can disregard my last email, I am still troubleshooting some things
> > I’ll give a full report in a few hours.
> >
> > Thanks,
> >
> > Jake
> >
> > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.*** ***@***.***>> wrote:
> > >
> > >
> > > Yes.
> > >
> > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.*** ***@***.***>>
> > > wrote:
> > >
> > > > Thanks ill look into this. Is one potential solution also zipping the
> > file
> > > > such that it only has the extension .gz?
> > > >
> > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.*** ***@***.***>>
> > wrote:
> > > > >
> > > > >
> > > > > You are ready .tar.gz compressed file but in your dataframe read CSV
> > you
> > > > > are mentioning .gz compressed. This is causing the problem. Could you
> > > > look
> > > > > into how to read .tar.gz compression to dataframe.
> > > > >
> > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> > ***@***.*** ***@***.***>>
> > > > > wrote:
> > > > >
> > > > > > Hi Devika,
> > > > > >
> > > > > > After looking at this more one of the issues might have to do with
> > how
> > > > it
> > > > > > is being read into Python. When I read in the tarred file directly
> > into
> > > > > > python, there is a weird value in the first row and first column
> > > > > > intersection. This does not occur if I first unzip the file and
> > then
> > > > load
> > > > > > the .csv into Python. Why might this be happening? See below:
> > > > > >
> > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > > > > > sep='\t',dtype='unicode',index_col=None,
> > > > > > low_memory='true',compression='gzip', header=None)
> > > > > > df.head()
> > > > > > >>> df.head()
> > > > > > 0 1 2 3 4 5 6
> > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d
> > 0 \N
> > > > \N
> > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > > >
> > > > > >
> > > > > > Compared to this when reading in the unzipped file:
> > > > > >
> > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > > > > > sep='\t',dtype='unicode',index_col=None,
> > low_memory='true',header=None)
> > > > > > >>> df.head()
> > > > > > 0 1 2 3 4 5 6
> > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > > > >>>
> > > > > >
> > > > > >
> > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.*** ***@***.***>>
> > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > I think it is a memory issue. Please divide the file in smaller
> > size
> > > > and
> > > > > > > try again and let's see what happens.
> > > > > > >
> > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> > > > ***@***.*** ***@***.***>>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Okay, thanks Devika. This might solve one issue but also recall
> > > > that
> > > > > > last
> > > > > > > > night the process died while reading one of the smaller tables
> > (RI)
> > > > > > into
> > > > > > > > OmniSci, so after successfully loading it into the Python
> > > > environment.
> > > > > > > >
> > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> > ***@***.*** ***@***.***>>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Then your dataframe is running out of memory to read the
> > whole
> > > > file
> > > > > > at
> > > > > > > > once
> > > > > > > > > since it's too big. Please read it in chunks, look into
> > chunksize
> > > > > > option
> > > > > > > > > while using Pandas dataframe to modify the script:
> > > > > > > > >
> > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> > > > > > > >
> > > > > > > > —
> > > > > > > > You are receiving this because you commented.
> > > > > > > > Reply to this email directly, view it on GitHub
> > > > > > > > <
> > > > > >
> > > >
> > #13 (comment) <#13 (comment)>
> > > > > > >,
> > > > > > > > or unsubscribe
> > > > > > > > <
> > > > > >
> > > >
> > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA>
> > > > > > >
> > > > > > > > .
> > > > > > > >
> > > > > > > —
> > > > > > > You are receiving this because you authored the thread.
> > > > > > > Reply to this email directly, view it on GitHub <
> > > > > >
> > > >
> > #13 (comment) <#13 (comment)>
> > > > >,
> > > > > > or unsubscribe <
> > > > > >
> > > >
> > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA>
> > > > > > >.
> > > > > > >
> > > > > >
> > > > > > —
> > > > > > You are receiving this because you commented.
> > > > > > Reply to this email directly, view it on GitHub
> > > > > > <
> > > >
> > #13 (comment) <#13 (comment)>
> > > > >,
> > > > > > or unsubscribe
> > > > > > <
> > > >
> > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA>
> > > > >
> > > > > > .
> > > > > >
> > > > > —
> > > > > You are receiving this because you authored the thread.
> > > > > Reply to this email directly, view it on GitHub <
> > > >
> > #13 (comment) <#13 (comment)>
> > >,
> > > > or unsubscribe <
> > > >
> > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA>
> > > > >.
> > > > >
> > > >
> > > > —
> > > > You are receiving this because you commented.
> > > > Reply to this email directly, view it on GitHub
> > > > <
> > #13 (comment) <#13 (comment)>
> > >,
> > > > or unsubscribe
> > > > <
> > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA>
> > >
> > > > .
> > > >
> > > —
> > > You are receiving this because you authored the thread.
> > > Reply to this email directly, view it on GitHub <
> > #13 (comment) <#13 (comment)>>,
> > or unsubscribe <
> > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA>
> > >.
> > >
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#13 (comment) <#13 (comment)>>,
> > or unsubscribe
> > <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA>>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <#13 (comment) <#13 (comment)>>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA>>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA>.
|
Here is the code used to make the table:
conn.execute("Create table IF NOT EXISTS knn (source_id TEXT ENCODING NONE, neighbor_id TEXT ENCODING NONE, dist FLOAT, dpost FLOAT, rpost FLOAT);")
conn.load_table_columnar("knn", df,preserve_index=False)
… On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***> wrote:
What is the data type for source_id?
On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.*** ***@***.***>> wrote:
Hi Devika,
So I have figured out how to handle reading in the zipped files, and I have been able to read in some of the smaller files to both Python and OmniSci. The issues I am running into now involve running the modeling code you provided, as am getting errors related to grouping on string columns. You can see that output below:
>>> conn.execute("Create table results as (SELECT source_id, AVG(dpost) as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost * 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);")
Traceback (most recent call last):
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute
at_most_n=-1,
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute
return self.recv_sql_execute()
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Cannot group by string columns which are not dictionary encoded.')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute
return c.execute(operation, parameters=parameters)
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Cannot group by string columns which are not dictionary encoded.
I also got an error that I could not join tables using TEXT type variables in OmniSci. This occurred when I was trying to merge in the new rpost and dpost values:
>>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);")
Traceback (most recent call last):
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute
at_most_n=-1,
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute
return self.recv_sql_execute()
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Projection type TEXT not supported for outer joins yet')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute
return c.execute(operation, parameters=parameters)
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Projection type TEXT not supported for outer joins yet
> On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.*** ***@***.***>> wrote:
>
>
> Sure, take your time.
>
> On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.*** ***@***.***>>
> wrote:
>
> > Hi Devika,
> >
> > You can disregard my last email, I am still troubleshooting some things
> > I’ll give a full report in a few hours.
> >
> > Thanks,
> >
> > Jake
> >
> > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.*** ***@***.***>> wrote:
> > >
> > >
> > > Yes.
> > >
> > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown ***@***.*** ***@***.***>>
> > > wrote:
> > >
> > > > Thanks ill look into this. Is one potential solution also zipping the
> > file
> > > > such that it only has the extension .gz?
> > > >
> > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.*** ***@***.***>>
> > wrote:
> > > > >
> > > > >
> > > > > You are ready .tar.gz compressed file but in your dataframe read CSV
> > you
> > > > > are mentioning .gz compressed. This is causing the problem. Could you
> > > > look
> > > > > into how to read .tar.gz compression to dataframe.
> > > > >
> > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> > ***@***.*** ***@***.***>>
> > > > > wrote:
> > > > >
> > > > > > Hi Devika,
> > > > > >
> > > > > > After looking at this more one of the issues might have to do with
> > how
> > > > it
> > > > > > is being read into Python. When I read in the tarred file directly
> > into
> > > > > > python, there is a weird value in the first row and first column
> > > > > > intersection. This does not occur if I first unzip the file and
> > then
> > > > load
> > > > > > the .csv into Python. Why might this be happening? See below:
> > > > > >
> > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > > > > > sep='\t',dtype='unicode',index_col=None,
> > > > > > low_memory='true',compression='gzip', header=None)
> > > > > > df.head()
> > > > > > >>> df.head()
> > > > > > 0 1 2 3 4 5 6
> > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502 i d
> > 0 \N
> > > > \N
> > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > > >
> > > > > >
> > > > > > Compared to this when reading in the unzipped file:
> > > > > >
> > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > > > > > sep='\t',dtype='unicode',index_col=None,
> > low_memory='true',header=None)
> > > > > > >>> df.head()
> > > > > > 0 1 2 3 4 5 6
> > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > > > >>>
> > > > > >
> > > > > >
> > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar ***@***.*** ***@***.***>>
> > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > I think it is a memory issue. Please divide the file in smaller
> > size
> > > > and
> > > > > > > try again and let's see what happens.
> > > > > > >
> > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> > > > ***@***.*** ***@***.***>>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Okay, thanks Devika. This might solve one issue but also recall
> > > > that
> > > > > > last
> > > > > > > > night the process died while reading one of the smaller tables
> > (RI)
> > > > > > into
> > > > > > > > OmniSci, so after successfully loading it into the Python
> > > > environment.
> > > > > > > >
> > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> > ***@***.*** ***@***.***>>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Then your dataframe is running out of memory to read the
> > whole
> > > > file
> > > > > > at
> > > > > > > > once
> > > > > > > > > since it's too big. Please read it in chunks, look into
> > chunksize
> > > > > > option
> > > > > > > > > while using Pandas dataframe to modify the script:
> > > > > > > > >
> > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> > > > > > > >
> > > > > > > > —
> > > > > > > > You are receiving this because you commented.
> > > > > > > > Reply to this email directly, view it on GitHub
> > > > > > > > <
> > > > > >
> > > >
> > #13 (comment) <#13 (comment)>
> > > > > > >,
> > > > > > > > or unsubscribe
> > > > > > > > <
> > > > > >
> > > >
> > https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA>
> > > > > > >
> > > > > > > > .
> > > > > > > >
> > > > > > > —
> > > > > > > You are receiving this because you authored the thread.
> > > > > > > Reply to this email directly, view it on GitHub <
> > > > > >
> > > >
> > #13 (comment) <#13 (comment)>
> > > > >,
> > > > > > or unsubscribe <
> > > > > >
> > > >
> > https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA>
> > > > > > >.
> > > > > > >
> > > > > >
> > > > > > —
> > > > > > You are receiving this because you commented.
> > > > > > Reply to this email directly, view it on GitHub
> > > > > > <
> > > >
> > #13 (comment) <#13 (comment)>
> > > > >,
> > > > > > or unsubscribe
> > > > > > <
> > > >
> > https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA>
> > > > >
> > > > > > .
> > > > > >
> > > > > —
> > > > > You are receiving this because you authored the thread.
> > > > > Reply to this email directly, view it on GitHub <
> > > >
> > #13 (comment) <#13 (comment)>
> > >,
> > > > or unsubscribe <
> > > >
> > https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA>
> > > > >.
> > > > >
> > > >
> > > > —
> > > > You are receiving this because you commented.
> > > > Reply to this email directly, view it on GitHub
> > > > <
> > #13 (comment) <#13 (comment)>
> > >,
> > > > or unsubscribe
> > > > <
> > https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA>
> > >
> > > > .
> > > >
> > > —
> > > You are receiving this because you authored the thread.
> > > Reply to this email directly, view it on GitHub <
> > #13 (comment) <#13 (comment)>>,
> > or unsubscribe <
> > https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA>
> > >.
> > >
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#13 (comment) <#13 (comment)>>,
> > or unsubscribe
> > <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA>>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <#13 (comment) <#13 (comment)>>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA <https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA>>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA>.
|
Please use TEXT ENCODING DICT wherever you define it.
…On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***> wrote:
The data type for source_id is STR
On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***>
wrote:
What is the data type for source_id?
On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.***>
wrote:
>
> Hi Devika,
>
> So I have figured out how to handle reading in the zipped files, and I
> have been able to read in some of the smaller files to both Python and
> OmniSci. The issues I am running into now involve running the modeling code
> you provided, as am getting errors related to grouping on string columns.
> You can see that output below:
>
> >>> conn.execute("Create table results as (SELECT source_id, AVG(dpost)
> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost *
> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost *
> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);")
> Traceback (most recent call last):
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> line 118, in execute
> at_most_n=-1,
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> line 1755, in sql_execute
> return self.recv_sql_execute()
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> line 1784, in recv_sql_execute
> raise result.e
> omnisci.thrift.ttypes.TOmniSciException:
> TOmniSciException(error_msg='Exception: Cannot group by string columns
> which are not dictionary encoded.')
>
> The above exception was the direct cause of the following exception:
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> line 390, in execute
> return c.execute(operation, parameters=parameters)
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> line 121, in execute
> raise _translate_exception(e) from e
> pymapd.exceptions.Error: Exception: Cannot group by string columns which
> are not dictionary encoded.
>
>
>
>
>
> I also got an error that I could not join tables using TEXT type
> variables in OmniSci. This occurred when I was trying to merge in the new
> rpost and dpost values:
>
> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg
> ON knn.neighbor_id = mrg.neighbor_id);")
> Traceback (most recent call last):
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> line 118, in execute
> at_most_n=-1,
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> line 1755, in sql_execute
> return self.recv_sql_execute()
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> line 1784, in recv_sql_execute
> raise result.e
> omnisci.thrift.ttypes.TOmniSciException:
> TOmniSciException(error_msg='Exception: Projection type TEXT not supported
> for outer joins yet')
>
> The above exception was the direct cause of the following exception:
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> line 390, in execute
> return c.execute(operation, parameters=parameters)
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> line 121, in execute
> raise _translate_exception(e) from e
> pymapd.exceptions.Error: Exception: Projection type TEXT not supported
> for outer joins yet
>
> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> wrote:
> >
> >
> > Sure, take your time.
> >
> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.***>
> > wrote:
> >
> > > Hi Devika,
> > >
> > > You can disregard my last email, I am still troubleshooting some
> things
> > > I’ll give a full report in a few hours.
> > >
> > > Thanks,
> > >
> > > Jake
> > >
> > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***>
> wrote:
> > > >
> > > >
> > > > Yes.
> > > >
> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <
> ***@***.***>
> > > > wrote:
> > > >
> > > > > Thanks ill look into this. Is one potential solution also zipping
> the
> > > file
> > > > > such that it only has the extension .gz?
> > > > >
> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***>
> > > wrote:
> > > > > >
> > > > > >
> > > > > > You are ready .tar.gz compressed file but in your dataframe
> read CSV
> > > you
> > > > > > are mentioning .gz compressed. This is causing the problem.
> Could you
> > > > > look
> > > > > > into how to read .tar.gz compression to dataframe.
> > > > > >
> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> > > ***@***.***>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Devika,
> > > > > > >
> > > > > > > After looking at this more one of the issues might have to do
> with
> > > how
> > > > > it
> > > > > > > is being read into Python. When I read in the tarred file
> directly
> > > into
> > > > > > > python, there is a weird value in the first row and first
> column
> > > > > > > intersection. This does not occur if I first unzip the file
> and
> > > then
> > > > > load
> > > > > > > the .csv into Python. Why might this be happening? See below:
> > > > > > >
> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > > > > > > sep='\t',dtype='unicode',index_col=None,
> > > > > > > low_memory='true',compression='gzip', header=None)
> > > > > > > df.head()
> > > > > > > >>> df.head()
> > > > > > > 0 1 2 3 4 5 6
> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502
> i d
> > > 0 \N
> > > > > \N
> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > > > >
> > > > > > >
> > > > > > > Compared to this when reading in the unzipped file:
> > > > > > >
> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > > > > > > sep='\t',dtype='unicode',index_col=None,
> > > low_memory='true',header=None)
> > > > > > > >>> df.head()
> > > > > > > 0 1 2 3 4 5 6
> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > > > > >>>
> > > > > > >
> > > > > > >
> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar <
> ***@***.***>
> > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > I think it is a memory issue. Please divide the file in
> smaller
> > > size
> > > > > and
> > > > > > > > try again and let's see what happens.
> > > > > > > >
> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> > > > > ***@***.***>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Okay, thanks Devika. This might solve one issue but also
> recall
> > > > > that
> > > > > > > last
> > > > > > > > > night the process died while reading one of the smaller
> tables
> > > (RI)
> > > > > > > into
> > > > > > > > > OmniSci, so after successfully loading it into the Python
> > > > > environment.
> > > > > > > > >
> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> > > ***@***.***>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Then your dataframe is running out of memory to read the
> > > whole
> > > > > file
> > > > > > > at
> > > > > > > > > once
> > > > > > > > > > since it's too big. Please read it in chunks, look into
> > > chunksize
> > > > > > > option
> > > > > > > > > > while using Pandas dataframe to modify the script:
> > > > > > > > > >
> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> > > > > > > > >
> > > > > > > > > —
> > > > > > > > > You are receiving this because you commented.
> > > > > > > > > Reply to this email directly, view it on GitHub
> > > > > > > > > <
> > > > > > >
> > > > >
> > >
> #13 (comment)
> > > > > > > >,
> > > > > > > > > or unsubscribe
> > > > > > > > > <
> > > > > > >
> > > > >
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > > > > > > >
> > > > > > > > > .
> > > > > > > > >
> > > > > > > > —
> > > > > > > > You are receiving this because you authored the thread.
> > > > > > > > Reply to this email directly, view it on GitHub <
> > > > > > >
> > > > >
> > >
> #13 (comment)
> > > > > >,
> > > > > > > or unsubscribe <
> > > > > > >
> > > > >
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > > > > > > >.
> > > > > > > >
> > > > > > >
> > > > > > > —
> > > > > > > You are receiving this because you commented.
> > > > > > > Reply to this email directly, view it on GitHub
> > > > > > > <
> > > > >
> > >
> #13 (comment)
> > > > > >,
> > > > > > > or unsubscribe
> > > > > > > <
> > > > >
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> > > > > >
> > > > > > > .
> > > > > > >
> > > > > > —
> > > > > > You are receiving this because you authored the thread.
> > > > > > Reply to this email directly, view it on GitHub <
> > > > >
> > >
> #13 (comment)
> > > >,
> > > > > or unsubscribe <
> > > > >
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> > > > > >.
> > > > > >
> > > > >
> > > > > —
> > > > > You are receiving this because you commented.
> > > > > Reply to this email directly, view it on GitHub
> > > > > <
> > >
> #13 (comment)
> > > >,
> > > > > or unsubscribe
> > > > > <
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
> > > >
> > > > > .
> > > > >
> > > > —
> > > > You are receiving this because you authored the thread.
> > > > Reply to this email directly, view it on GitHub <
> > >
> #13 (comment)
> >,
> > > or unsubscribe <
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
> > > >.
> > > >
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub
> > > <
> #13 (comment)
> >,
> > > or unsubscribe
> > > <
> https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA
> >
> > > .
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #13 (comment)>,
> or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA
> >.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA>
> .
>
|
Thanks Devika,
That seems to fix those issues. I think the remaining issue is the potential memory issue, which I can solve by outputting smaller files, and an issue when joining in sql/Omnisci. I am running up against a unique constraint error that I do not understand. The rpost/dpost data frame that I am joining to the knn output will have multiple matches, since I am joining it to neighbor_id, and sometimes people share neighbors. There are no duplicates in the rpost/dpost data frame, as it contains one row for each registered voter (or each potential neighbor, if you will). This kind of merge/join would not be a problem using similar functions in python/R, but seems to run up against a join difficulty in sql. Can you clarify what is going on?
>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON knn.neighbor_id = mrg.neighbor_id);")
Traceback (most recent call last):
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 118, in execute
at_most_n=-1,
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1755, in sql_execute
return self.recv_sql_execute()
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py", line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException: TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name')
The above exception was the direct cause of the following exception:
[jbrown613@boslogin04 ~]$
File "<stdin>", line 1, in <module>
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 390, in execute
return c.execute(operation, parameters=parameters)
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py", line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint failed: mapd_columns.tableid, mapd_columns.name
… On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> wrote:
Please use TEXT ENCODING DICT wherever you define it.
On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***> wrote:
> The data type for source_id is STR
>
> On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***>
> wrote:
>
> What is the data type for source_id?
>
> On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.***>
> wrote:
>
>>
>> Hi Devika,
>>
>> So I have figured out how to handle reading in the zipped files, and I
>> have been able to read in some of the smaller files to both Python and
>> OmniSci. The issues I am running into now involve running the modeling code
>> you provided, as am getting errors related to grouping on string columns.
>> You can see that output below:
>>
>> >>> conn.execute("Create table results as (SELECT source_id, AVG(dpost)
>> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost *
>> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost *
>> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY source_id);")
>> Traceback (most recent call last):
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> line 118, in execute
>> at_most_n=-1,
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> line 1755, in sql_execute
>> return self.recv_sql_execute()
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> line 1784, in recv_sql_execute
>> raise result.e
>> omnisci.thrift.ttypes.TOmniSciException:
>> TOmniSciException(error_msg='Exception: Cannot group by string columns
>> which are not dictionary encoded.')
>>
>> The above exception was the direct cause of the following exception:
>>
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
>> line 390, in execute
>> return c.execute(operation, parameters=parameters)
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> line 121, in execute
>> raise _translate_exception(e) from e
>> pymapd.exceptions.Error: Exception: Cannot group by string columns which
>> are not dictionary encoded.
>>
>>
>>
>>
>>
>> I also got an error that I could not join tables using TEXT type
>> variables in OmniSci. This occurred when I was trying to merge in the new
>> rpost and dpost values:
>>
>> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg
>> ON knn.neighbor_id = mrg.neighbor_id);")
>> Traceback (most recent call last):
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> line 118, in execute
>> at_most_n=-1,
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> line 1755, in sql_execute
>> return self.recv_sql_execute()
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> line 1784, in recv_sql_execute
>> raise result.e
>> omnisci.thrift.ttypes.TOmniSciException:
>> TOmniSciException(error_msg='Exception: Projection type TEXT not supported
>> for outer joins yet')
>>
>> The above exception was the direct cause of the following exception:
>>
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
>> line 390, in execute
>> return c.execute(operation, parameters=parameters)
>> File
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> line 121, in execute
>> raise _translate_exception(e) from e
>> pymapd.exceptions.Error: Exception: Projection type TEXT not supported
>> for outer joins yet
>>
>> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***> wrote:
>> >
>> >
>> > Sure, take your time.
>> >
>> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown ***@***.***>
>> > wrote:
>> >
>> > > Hi Devika,
>> > >
>> > > You can disregard my last email, I am still troubleshooting some
>> things
>> > > I’ll give a full report in a few hours.
>> > >
>> > > Thanks,
>> > >
>> > > Jake
>> > >
>> > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***>
>> wrote:
>> > > >
>> > > >
>> > > > Yes.
>> > > >
>> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <
>> ***@***.***>
>> > > > wrote:
>> > > >
>> > > > > Thanks ill look into this. Is one potential solution also zipping
>> the
>> > > file
>> > > > > such that it only has the extension .gz?
>> > > > >
>> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar ***@***.***>
>> > > wrote:
>> > > > > >
>> > > > > >
>> > > > > > You are ready .tar.gz compressed file but in your dataframe
>> read CSV
>> > > you
>> > > > > > are mentioning .gz compressed. This is causing the problem.
>> Could you
>> > > > > look
>> > > > > > into how to read .tar.gz compression to dataframe.
>> > > > > >
>> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
>> > > ***@***.***>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Devika,
>> > > > > > >
>> > > > > > > After looking at this more one of the issues might have to do
>> with
>> > > how
>> > > > > it
>> > > > > > > is being read into Python. When I read in the tarred file
>> directly
>> > > into
>> > > > > > > python, there is a weird value in the first row and first
>> column
>> > > > > > > intersection. This does not occur if I first unzip the file
>> and
>> > > then
>> > > > > load
>> > > > > > > the .csv into Python. Why might this be happening? See below:
>> > > > > > >
>> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
>> > > > > > > sep='\t',dtype='unicode',index_col=None,
>> > > > > > > low_memory='true',compression='gzip', header=None)
>> > > > > > > df.head()
>> > > > > > > >>> df.head()
>> > > > > > > 0 1 2 3 4 5 6
>> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10... AK-709502
>> i d
>> > > 0 \N
>> > > > > \N
>> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
>> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
>> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
>> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
>> > > > > > >
>> > > > > > >
>> > > > > > > Compared to this when reading in the unzipped file:
>> > > > > > >
>> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
>> > > > > > > sep='\t',dtype='unicode',index_col=None,
>> > > low_memory='true',header=None)
>> > > > > > > >>> df.head()
>> > > > > > > 0 1 2 3 4 5 6
>> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
>> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
>> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
>> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
>> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
>> > > > > > > >>>
>> > > > > > >
>> > > > > > >
>> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar <
>> ***@***.***>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I think it is a memory issue. Please divide the file in
>> smaller
>> > > size
>> > > > > and
>> > > > > > > > try again and let's see what happens.
>> > > > > > > >
>> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
>> > > > > ***@***.***>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Okay, thanks Devika. This might solve one issue but also
>> recall
>> > > > > that
>> > > > > > > last
>> > > > > > > > > night the process died while reading one of the smaller
>> tables
>> > > (RI)
>> > > > > > > into
>> > > > > > > > > OmniSci, so after successfully loading it into the Python
>> > > > > environment.
>> > > > > > > > >
>> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
>> > > ***@***.***>
>> > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > Then your dataframe is running out of memory to read the
>> > > whole
>> > > > > file
>> > > > > > > at
>> > > > > > > > > once
>> > > > > > > > > > since it's too big. Please read it in chunks, look into
>> > > chunksize
>> > > > > > > option
>> > > > > > > > > > while using Pandas dataframe to modify the script:
>> > > > > > > > > >
>> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
>> > > > > > > > >
>> > > > > > > > > —
>> > > > > > > > > You are receiving this because you commented.
>> > > > > > > > > Reply to this email directly, view it on GitHub
>> > > > > > > > > <
>> > > > > > >
>> > > > >
>> > >
>> #13 (comment)
>> > > > > > > >,
>> > > > > > > > > or unsubscribe
>> > > > > > > > > <
>> > > > > > >
>> > > > >
>> > >
>> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
>> > > > > > > >
>> > > > > > > > > .
>> > > > > > > > >
>> > > > > > > > —
>> > > > > > > > You are receiving this because you authored the thread.
>> > > > > > > > Reply to this email directly, view it on GitHub <
>> > > > > > >
>> > > > >
>> > >
>> #13 (comment)
>> > > > > >,
>> > > > > > > or unsubscribe <
>> > > > > > >
>> > > > >
>> > >
>> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
>> > > > > > > >.
>> > > > > > > >
>> > > > > > >
>> > > > > > > —
>> > > > > > > You are receiving this because you commented.
>> > > > > > > Reply to this email directly, view it on GitHub
>> > > > > > > <
>> > > > >
>> > >
>> #13 (comment)
>> > > > > >,
>> > > > > > > or unsubscribe
>> > > > > > > <
>> > > > >
>> > >
>> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
>> > > > > >
>> > > > > > > .
>> > > > > > >
>> > > > > > —
>> > > > > > You are receiving this because you authored the thread.
>> > > > > > Reply to this email directly, view it on GitHub <
>> > > > >
>> > >
>> #13 (comment)
>> > > >,
>> > > > > or unsubscribe <
>> > > > >
>> > >
>> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
>> > > > > >.
>> > > > > >
>> > > > >
>> > > > > —
>> > > > > You are receiving this because you commented.
>> > > > > Reply to this email directly, view it on GitHub
>> > > > > <
>> > >
>> #13 (comment)
>> > > >,
>> > > > > or unsubscribe
>> > > > > <
>> > >
>> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
>> > > >
>> > > > > .
>> > > > >
>> > > > —
>> > > > You are receiving this because you authored the thread.
>> > > > Reply to this email directly, view it on GitHub <
>> > >
>> #13 (comment)
>> >,
>> > > or unsubscribe <
>> > >
>> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
>> > > >.
>> > > >
>> > >
>> > > —
>> > > You are receiving this because you commented.
>> > > Reply to this email directly, view it on GitHub
>> > > <
>> #13 (comment)
>> >,
>> > > or unsubscribe
>> > > <
>> https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA
>> >
>> > > .
>> > >
>> > —
>> > You are receiving this because you authored the thread.
>> > Reply to this email directly, view it on GitHub <
>> #13 (comment)>,
>> or unsubscribe <
>> https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA
>> >.
>> >
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <#13 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA>
>> .
>>
>
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA>.
|
Pls send me column bames for both tables.
…On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***> wrote:
Thanks Devika,
That seems to fix those issues. I think the remaining issue is the
potential memory issue, which I can solve by outputting smaller files, and
an issue when joining in sql/Omnisci. I am running up against a unique
constraint error that I do not understand. The rpost/dpost data frame that
I am joining to the knn output will have multiple matches, since I am
joining it to neighbor_id, and sometimes people share neighbors. There are
no duplicates in the rpost/dpost data frame, as it contains one row for
each registered voter (or each potential neighbor, if you will). This kind
of merge/join would not be a problem using similar functions in python/R,
but seems to run up against a join difficulty in sql. Can you clarify what
is going on?
>>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON
knn.neighbor_id = mrg.neighbor_id);")
Traceback (most recent call last):
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
line 118, in execute
at_most_n=-1,
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
line 1755, in sql_execute
return self.recv_sql_execute()
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
line 1784, in recv_sql_execute
raise result.e
omnisci.thrift.ttypes.TOmniSciException:
TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint
failed: mapd_columns.tableid, mapd_columns.name')
The above exception was the direct cause of the following exception:
***@***.*** ~]$
File "<stdin>", line 1, in <module>
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
line 390, in execute
return c.execute(operation, parameters=parameters)
File
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
line 121, in execute
raise _translate_exception(e) from e
pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint
failed: mapd_columns.tableid, mapd_columns.name
> On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> wrote:
>
>
> Please use TEXT ENCODING DICT wherever you define it.
>
> On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***>
wrote:
>
> > The data type for source_id is STR
> >
> > On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***>
> > wrote:
> >
> > What is the data type for source_id?
> >
> > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.***>
> > wrote:
> >
> >>
> >> Hi Devika,
> >>
> >> So I have figured out how to handle reading in the zipped files, and I
> >> have been able to read in some of the smaller files to both Python and
> >> OmniSci. The issues I am running into now involve running the
modeling code
> >> you provided, as am getting errors related to grouping on string
columns.
> >> You can see that output below:
> >>
> >> >>> conn.execute("Create table results as (SELECT source_id,
AVG(dpost)
> >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost *
> >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost *
> >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY
source_id);")
> >> Traceback (most recent call last):
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> line 118, in execute
> >> at_most_n=-1,
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> line 1755, in sql_execute
> >> return self.recv_sql_execute()
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> line 1784, in recv_sql_execute
> >> raise result.e
> >> omnisci.thrift.ttypes.TOmniSciException:
> >> TOmniSciException(error_msg='Exception: Cannot group by string columns
> >> which are not dictionary encoded.')
> >>
> >> The above exception was the direct cause of the following exception:
> >>
> >> Traceback (most recent call last):
> >> File "<stdin>", line 1, in <module>
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> >> line 390, in execute
> >> return c.execute(operation, parameters=parameters)
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> line 121, in execute
> >> raise _translate_exception(e) from e
> >> pymapd.exceptions.Error: Exception: Cannot group by string columns
which
> >> are not dictionary encoded.
> >>
> >>
> >>
> >>
> >>
> >> I also got an error that I could not join tables using TEXT type
> >> variables in OmniSci. This occurred when I was trying to merge in the
new
> >> rpost and dpost values:
> >>
> >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN
mrg
> >> ON knn.neighbor_id = mrg.neighbor_id);")
> >> Traceback (most recent call last):
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> line 118, in execute
> >> at_most_n=-1,
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> line 1755, in sql_execute
> >> return self.recv_sql_execute()
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> line 1784, in recv_sql_execute
> >> raise result.e
> >> omnisci.thrift.ttypes.TOmniSciException:
> >> TOmniSciException(error_msg='Exception: Projection type TEXT not
supported
> >> for outer joins yet')
> >>
> >> The above exception was the direct cause of the following exception:
> >>
> >> Traceback (most recent call last):
> >> File "<stdin>", line 1, in <module>
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> >> line 390, in execute
> >> return c.execute(operation, parameters=parameters)
> >> File
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> line 121, in execute
> >> raise _translate_exception(e) from e
> >> pymapd.exceptions.Error: Exception: Projection type TEXT not supported
> >> for outer joins yet
> >>
> >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***>
wrote:
> >> >
> >> >
> >> > Sure, take your time.
> >> >
> >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <
***@***.***>
> >> > wrote:
> >> >
> >> > > Hi Devika,
> >> > >
> >> > > You can disregard my last email, I am still troubleshooting some
> >> things
> >> > > I’ll give a full report in a few hours.
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Jake
> >> > >
> >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***>
> >> wrote:
> >> > > >
> >> > > >
> >> > > > Yes.
> >> > > >
> >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <
> >> ***@***.***>
> >> > > > wrote:
> >> > > >
> >> > > > > Thanks ill look into this. Is one potential solution also
zipping
> >> the
> >> > > file
> >> > > > > such that it only has the extension .gz?
> >> > > > >
> >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar <
***@***.***>
> >> > > wrote:
> >> > > > > >
> >> > > > > >
> >> > > > > > You are ready .tar.gz compressed file but in your dataframe
> >> read CSV
> >> > > you
> >> > > > > > are mentioning .gz compressed. This is causing the problem.
> >> Could you
> >> > > > > look
> >> > > > > > into how to read .tar.gz compression to dataframe.
> >> > > > > >
> >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> >> > > ***@***.***>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi Devika,
> >> > > > > > >
> >> > > > > > > After looking at this more one of the issues might have
to do
> >> with
> >> > > how
> >> > > > > it
> >> > > > > > > is being read into Python. When I read in the tarred file
> >> directly
> >> > > into
> >> > > > > > > python, there is a weird value in the first row and first
> >> column
> >> > > > > > > intersection. This does not occur if I first unzip the
file
> >> and
> >> > > then
> >> > > > > load
> >> > > > > > > the .csv into Python. Why might this be happening? See
below:
> >> > > > > > >
> >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> >> > > > > > > low_memory='true',compression='gzip', header=None)
> >> > > > > > > df.head()
> >> > > > > > > >>> df.head()
> >> > > > > > > 0 1 2 3 4 5 6
> >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10...
AK-709502
> >> i d
> >> > > 0 \N
> >> > > > > \N
> >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Compared to this when reading in the unzipped file:
> >> > > > > > >
> >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> >> > > low_memory='true',header=None)
> >> > > > > > > >>> df.head()
> >> > > > > > > 0 1 2 3 4 5 6
> >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar <
> >> ***@***.***>
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > I think it is a memory issue. Please divide the file in
> >> smaller
> >> > > size
> >> > > > > and
> >> > > > > > > > try again and let's see what happens.
> >> > > > > > > >
> >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> >> > > > > ***@***.***>
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Okay, thanks Devika. This might solve one issue but
also
> >> recall
> >> > > > > that
> >> > > > > > > last
> >> > > > > > > > > night the process died while reading one of the
smaller
> >> tables
> >> > > (RI)
> >> > > > > > > into
> >> > > > > > > > > OmniSci, so after successfully loading it into the
Python
> >> > > > > environment.
> >> > > > > > > > >
> >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> >> > > ***@***.***>
> >> > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > Then your dataframe is running out of memory to
read the
> >> > > whole
> >> > > > > file
> >> > > > > > > at
> >> > > > > > > > > once
> >> > > > > > > > > > since it's too big. Please read it in chunks, look
into
> >> > > chunksize
> >> > > > > > > option
> >> > > > > > > > > > while using Pandas dataframe to modify the script:
> >> > > > > > > > > >
> >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> >> > > > > > > > >
> >> > > > > > > > > —
> >> > > > > > > > > You are receiving this because you commented.
> >> > > > > > > > > Reply to this email directly, view it on GitHub
> >> > > > > > > > > <
> >> > > > > > >
> >> > > > >
> >> > >
> >>
#13 (comment)
> >> > > > > > > >,
> >> > > > > > > > > or unsubscribe
> >> > > > > > > > > <
> >> > > > > > >
> >> > > > >
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> >> > > > > > > >
> >> > > > > > > > > .
> >> > > > > > > > >
> >> > > > > > > > —
> >> > > > > > > > You are receiving this because you authored the thread.
> >> > > > > > > > Reply to this email directly, view it on GitHub <
> >> > > > > > >
> >> > > > >
> >> > >
> >>
#13 (comment)
> >> > > > > >,
> >> > > > > > > or unsubscribe <
> >> > > > > > >
> >> > > > >
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> >> > > > > > > >.
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > —
> >> > > > > > > You are receiving this because you commented.
> >> > > > > > > Reply to this email directly, view it on GitHub
> >> > > > > > > <
> >> > > > >
> >> > >
> >>
#13 (comment)
> >> > > > > >,
> >> > > > > > > or unsubscribe
> >> > > > > > > <
> >> > > > >
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> >> > > > > >
> >> > > > > > > .
> >> > > > > > >
> >> > > > > > —
> >> > > > > > You are receiving this because you authored the thread.
> >> > > > > > Reply to this email directly, view it on GitHub <
> >> > > > >
> >> > >
> >>
#13 (comment)
> >> > > >,
> >> > > > > or unsubscribe <
> >> > > > >
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> >> > > > > >.
> >> > > > > >
> >> > > > >
> >> > > > > —
> >> > > > > You are receiving this because you commented.
> >> > > > > Reply to this email directly, view it on GitHub
> >> > > > > <
> >> > >
> >>
#13 (comment)
> >> > > >,
> >> > > > > or unsubscribe
> >> > > > > <
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
> >> > > >
> >> > > > > .
> >> > > > >
> >> > > > —
> >> > > > You are receiving this because you authored the thread.
> >> > > > Reply to this email directly, view it on GitHub <
> >> > >
> >>
#13 (comment)
> >> >,
> >> > > or unsubscribe <
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
> >> > > >.
> >> > > >
> >> > >
> >> > > —
> >> > > You are receiving this because you commented.
> >> > > Reply to this email directly, view it on GitHub
> >> > > <
> >>
#13 (comment)
> >> >,
> >> > > or unsubscribe
> >> > > <
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA
> >> >
> >> > > .
> >> > >
> >> > —
> >> > You are receiving this because you authored the thread.
> >> > Reply to this email directly, view it on GitHub <
> >>
#13 (comment)
>,
> >> or unsubscribe <
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA
> >> >.
> >> >
> >>
> >> —
> >> You are receiving this because you commented.
> >> Reply to this email directly, view it on GitHub
> >> <
#13 (comment)
>,
> >> or unsubscribe
> >> <
https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA
>
> >> .
> >>
> >
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA>
.
|
Knn:
source_id
neighbor_id
dist
mrg:
dpost
rpost
neighbor_id
… On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***> wrote:
Pls send me column bames for both tables.
On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***> wrote:
> Thanks Devika,
>
> That seems to fix those issues. I think the remaining issue is the
> potential memory issue, which I can solve by outputting smaller files, and
> an issue when joining in sql/Omnisci. I am running up against a unique
> constraint error that I do not understand. The rpost/dpost data frame that
> I am joining to the knn output will have multiple matches, since I am
> joining it to neighbor_id, and sometimes people share neighbors. There are
> no duplicates in the rpost/dpost data frame, as it contains one row for
> each registered voter (or each potential neighbor, if you will). This kind
> of merge/join would not be a problem using similar functions in python/R,
> but seems to run up against a join difficulty in sql. Can you clarify what
> is going on?
>
> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN mrg ON
> knn.neighbor_id = mrg.neighbor_id);")
> Traceback (most recent call last):
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> line 118, in execute
> at_most_n=-1,
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> line 1755, in sql_execute
> return self.recv_sql_execute()
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> line 1784, in recv_sql_execute
> raise result.e
> omnisci.thrift.ttypes.TOmniSciException:
> TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE constraint
> failed: mapd_columns.tableid, mapd_columns.name')
>
> The above exception was the direct cause of the following exception:
>
> ***@***.*** ~]$
> File "<stdin>", line 1, in <module>
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> line 390, in execute
> return c.execute(operation, parameters=parameters)
> File
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> line 121, in execute
> raise _translate_exception(e) from e
> pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint
> failed: mapd_columns.tableid, mapd_columns.name
>
> > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***> wrote:
> >
> >
> > Please use TEXT ENCODING DICT wherever you define it.
> >
> > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***>
> wrote:
> >
> > > The data type for source_id is STR
> > >
> > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar ***@***.***>
> > > wrote:
> > >
> > > What is the data type for source_id?
> > >
> > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown ***@***.***>
> > > wrote:
> > >
> > >>
> > >> Hi Devika,
> > >>
> > >> So I have figured out how to handle reading in the zipped files, and I
> > >> have been able to read in some of the smaller files to both Python and
> > >> OmniSci. The issues I am running into now involve running the
> modeling code
> > >> you provided, as am getting errors related to grouping on string
> columns.
> > >> You can see that output below:
> > >>
> > >> >>> conn.execute("Create table results as (SELECT source_id,
> AVG(dpost)
> > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost *
> > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost *
> > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY
> source_id);")
> > >> Traceback (most recent call last):
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > >> line 118, in execute
> > >> at_most_n=-1,
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > >> line 1755, in sql_execute
> > >> return self.recv_sql_execute()
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > >> line 1784, in recv_sql_execute
> > >> raise result.e
> > >> omnisci.thrift.ttypes.TOmniSciException:
> > >> TOmniSciException(error_msg='Exception: Cannot group by string columns
> > >> which are not dictionary encoded.')
> > >>
> > >> The above exception was the direct cause of the following exception:
> > >>
> > >> Traceback (most recent call last):
> > >> File "<stdin>", line 1, in <module>
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> > >> line 390, in execute
> > >> return c.execute(operation, parameters=parameters)
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > >> line 121, in execute
> > >> raise _translate_exception(e) from e
> > >> pymapd.exceptions.Error: Exception: Cannot group by string columns
> which
> > >> are not dictionary encoded.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> I also got an error that I could not join tables using TEXT type
> > >> variables in OmniSci. This occurred when I was trying to merge in the
> new
> > >> rpost and dpost values:
> > >>
> > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN
> mrg
> > >> ON knn.neighbor_id = mrg.neighbor_id);")
> > >> Traceback (most recent call last):
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > >> line 118, in execute
> > >> at_most_n=-1,
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > >> line 1755, in sql_execute
> > >> return self.recv_sql_execute()
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > >> line 1784, in recv_sql_execute
> > >> raise result.e
> > >> omnisci.thrift.ttypes.TOmniSciException:
> > >> TOmniSciException(error_msg='Exception: Projection type TEXT not
> supported
> > >> for outer joins yet')
> > >>
> > >> The above exception was the direct cause of the following exception:
> > >>
> > >> Traceback (most recent call last):
> > >> File "<stdin>", line 1, in <module>
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> > >> line 390, in execute
> > >> return c.execute(operation, parameters=parameters)
> > >> File
> > >>
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > >> line 121, in execute
> > >> raise _translate_exception(e) from e
> > >> pymapd.exceptions.Error: Exception: Projection type TEXT not supported
> > >> for outer joins yet
> > >>
> > >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***>
> wrote:
> > >> >
> > >> >
> > >> > Sure, take your time.
> > >> >
> > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <
> ***@***.***>
> > >> > wrote:
> > >> >
> > >> > > Hi Devika,
> > >> > >
> > >> > > You can disregard my last email, I am still troubleshooting some
> > >> things
> > >> > > I’ll give a full report in a few hours.
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Jake
> > >> > >
> > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar ***@***.***>
> > >> wrote:
> > >> > > >
> > >> > > >
> > >> > > > Yes.
> > >> > > >
> > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <
> > >> ***@***.***>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Thanks ill look into this. Is one potential solution also
> zipping
> > >> the
> > >> > > file
> > >> > > > > such that it only has the extension .gz?
> > >> > > > >
> > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar <
> ***@***.***>
> > >> > > wrote:
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > You are ready .tar.gz compressed file but in your dataframe
> > >> read CSV
> > >> > > you
> > >> > > > > > are mentioning .gz compressed. This is causing the problem.
> > >> Could you
> > >> > > > > look
> > >> > > > > > into how to read .tar.gz compression to dataframe.
> > >> > > > > >
> > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> > >> > > ***@***.***>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi Devika,
> > >> > > > > > >
> > >> > > > > > > After looking at this more one of the issues might have
> to do
> > >> with
> > >> > > how
> > >> > > > > it
> > >> > > > > > > is being read into Python. When I read in the tarred file
> > >> directly
> > >> > > into
> > >> > > > > > > python, there is a weird value in the first row and first
> > >> column
> > >> > > > > > > intersection. This does not occur if I first unzip the
> file
> > >> and
> > >> > > then
> > >> > > > > load
> > >> > > > > > > the .csv into Python. Why might this be happening? See
> below:
> > >> > > > > > >
> > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> > >> > > > > > > low_memory='true',compression='gzip', header=None)
> > >> > > > > > > df.head()
> > >> > > > > > > >>> df.head()
> > >> > > > > > > 0 1 2 3 4 5 6
> > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10...
> AK-709502
> > >> i d
> > >> > > 0 \N
> > >> > > > > \N
> > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Compared to this when reading in the unzipped file:
> > >> > > > > > >
> > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> > >> > > low_memory='true',header=None)
> > >> > > > > > > >>> df.head()
> > >> > > > > > > 0 1 2 3 4 5 6
> > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > >> > > > > > > >>>
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar <
> > >> ***@***.***>
> > >> > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > I think it is a memory issue. Please divide the file in
> > >> smaller
> > >> > > size
> > >> > > > > and
> > >> > > > > > > > try again and let's see what happens.
> > >> > > > > > > >
> > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> > >> > > > > ***@***.***>
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Okay, thanks Devika. This might solve one issue but
> also
> > >> recall
> > >> > > > > that
> > >> > > > > > > last
> > >> > > > > > > > > night the process died while reading one of the
> smaller
> > >> tables
> > >> > > (RI)
> > >> > > > > > > into
> > >> > > > > > > > > OmniSci, so after successfully loading it into the
> Python
> > >> > > > > environment.
> > >> > > > > > > > >
> > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> > >> > > ***@***.***>
> > >> > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > Then your dataframe is running out of memory to
> read the
> > >> > > whole
> > >> > > > > file
> > >> > > > > > > at
> > >> > > > > > > > > once
> > >> > > > > > > > > > since it's too big. Please read it in chunks, look
> into
> > >> > > chunksize
> > >> > > > > > > option
> > >> > > > > > > > > > while using Pandas dataframe to modify the script:
> > >> > > > > > > > > >
> > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> > >> > > > > > > > >
> > >> > > > > > > > > —
> > >> > > > > > > > > You are receiving this because you commented.
> > >> > > > > > > > > Reply to this email directly, view it on GitHub
> > >> > > > > > > > > <
> > >> > > > > > >
> > >> > > > >
> > >> > >
> > >>
> #13 (comment)
> > >> > > > > > > >,
> > >> > > > > > > > > or unsubscribe
> > >> > > > > > > > > <
> > >> > > > > > >
> > >> > > > >
> > >> > >
> > >>
> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > >> > > > > > > >
> > >> > > > > > > > > .
> > >> > > > > > > > >
> > >> > > > > > > > —
> > >> > > > > > > > You are receiving this because you authored the thread.
> > >> > > > > > > > Reply to this email directly, view it on GitHub <
> > >> > > > > > >
> > >> > > > >
> > >> > >
> > >>
> #13 (comment)
> > >> > > > > >,
> > >> > > > > > > or unsubscribe <
> > >> > > > > > >
> > >> > > > >
> > >> > >
> > >>
> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > >> > > > > > > >.
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > > —
> > >> > > > > > > You are receiving this because you commented.
> > >> > > > > > > Reply to this email directly, view it on GitHub
> > >> > > > > > > <
> > >> > > > >
> > >> > >
> > >>
> #13 (comment)
> > >> > > > > >,
> > >> > > > > > > or unsubscribe
> > >> > > > > > > <
> > >> > > > >
> > >> > >
> > >>
> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> > >> > > > > >
> > >> > > > > > > .
> > >> > > > > > >
> > >> > > > > > —
> > >> > > > > > You are receiving this because you authored the thread.
> > >> > > > > > Reply to this email directly, view it on GitHub <
> > >> > > > >
> > >> > >
> > >>
> #13 (comment)
> > >> > > >,
> > >> > > > > or unsubscribe <
> > >> > > > >
> > >> > >
> > >>
> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> > >> > > > > >.
> > >> > > > > >
> > >> > > > >
> > >> > > > > —
> > >> > > > > You are receiving this because you commented.
> > >> > > > > Reply to this email directly, view it on GitHub
> > >> > > > > <
> > >> > >
> > >>
> #13 (comment)
> > >> > > >,
> > >> > > > > or unsubscribe
> > >> > > > > <
> > >> > >
> > >>
> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
> > >> > > >
> > >> > > > > .
> > >> > > > >
> > >> > > > —
> > >> > > > You are receiving this because you authored the thread.
> > >> > > > Reply to this email directly, view it on GitHub <
> > >> > >
> > >>
> #13 (comment)
> > >> >,
> > >> > > or unsubscribe <
> > >> > >
> > >>
> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
> > >> > > >.
> > >> > > >
> > >> > >
> > >> > > —
> > >> > > You are receiving this because you commented.
> > >> > > Reply to this email directly, view it on GitHub
> > >> > > <
> > >>
> #13 (comment)
> > >> >,
> > >> > > or unsubscribe
> > >> > > <
> > >>
> https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA
> > >> >
> > >> > > .
> > >> > >
> > >> > —
> > >> > You are receiving this because you authored the thread.
> > >> > Reply to this email directly, view it on GitHub <
> > >>
> #13 (comment)
> >,
> > >> or unsubscribe <
> > >>
> https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA
> > >> >.
> > >> >
> > >>
> > >> —
> > >> You are receiving this because you commented.
> > >> Reply to this email directly, view it on GitHub
> > >> <
> #13 (comment)
> >,
> > >> or unsubscribe
> > >> <
> https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA
> >
> > >> .
> > >>
> > >
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #13 (comment)>,
> or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA
> >.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA>.
|
Try:
*Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost,
b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);*
On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown <[email protected]>
wrote:
… Knn:
source_id
neighbor_id
dist
mrg:
dpost
rpost
neighbor_id
> On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***> wrote:
>
>
> Pls send me column bames for both tables.
>
> On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***>
wrote:
>
> > Thanks Devika,
> >
> > That seems to fix those issues. I think the remaining issue is the
> > potential memory issue, which I can solve by outputting smaller files,
and
> > an issue when joining in sql/Omnisci. I am running up against a unique
> > constraint error that I do not understand. The rpost/dpost data frame
that
> > I am joining to the knn output will have multiple matches, since I am
> > joining it to neighbor_id, and sometimes people share neighbors. There
are
> > no duplicates in the rpost/dpost data frame, as it contains one row for
> > each registered voter (or each potential neighbor, if you will). This
kind
> > of merge/join would not be a problem using similar functions in
python/R,
> > but seems to run up against a join difficulty in sql. Can you clarify
what
> > is going on?
> >
> > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN
mrg ON
> > knn.neighbor_id = mrg.neighbor_id);")
> > Traceback (most recent call last):
> > File
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > line 118, in execute
> > at_most_n=-1,
> > File
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > line 1755, in sql_execute
> > return self.recv_sql_execute()
> > File
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > line 1784, in recv_sql_execute
> > raise result.e
> > omnisci.thrift.ttypes.TOmniSciException:
> > TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE
constraint
> > failed: mapd_columns.tableid, mapd_columns.name')
> >
> > The above exception was the direct cause of the following exception:
> >
> > ***@***.*** ~]$
> > File "<stdin>", line 1, in <module>
> > File
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> > line 390, in execute
> > return c.execute(operation, parameters=parameters)
> > File
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > line 121, in execute
> > raise _translate_exception(e) from e
> > pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint
> > failed: mapd_columns.tableid, mapd_columns.name
> >
> > > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***>
wrote:
> > >
> > >
> > > Please use TEXT ENCODING DICT wherever you define it.
> > >
> > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***>
> > wrote:
> > >
> > > > The data type for source_id is STR
> > > >
> > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar <
***@***.***>
> > > > wrote:
> > > >
> > > > What is the data type for source_id?
> > > >
> > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown <
***@***.***>
> > > > wrote:
> > > >
> > > >>
> > > >> Hi Devika,
> > > >>
> > > >> So I have figured out how to handle reading in the zipped files,
and I
> > > >> have been able to read in some of the smaller files to both
Python and
> > > >> OmniSci. The issues I am running into now involve running the
> > modeling code
> > > >> you provided, as am getting errors related to grouping on string
> > columns.
> > > >> You can see that output below:
> > > >>
> > > >> >>> conn.execute("Create table results as (SELECT source_id,
> > AVG(dpost)
> > > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost *
> > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost *
> > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY
> > source_id);")
> > > >> Traceback (most recent call last):
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > >> line 118, in execute
> > > >> at_most_n=-1,
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > >> line 1755, in sql_execute
> > > >> return self.recv_sql_execute()
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > >> line 1784, in recv_sql_execute
> > > >> raise result.e
> > > >> omnisci.thrift.ttypes.TOmniSciException:
> > > >> TOmniSciException(error_msg='Exception: Cannot group by string
columns
> > > >> which are not dictionary encoded.')
> > > >>
> > > >> The above exception was the direct cause of the following
exception:
> > > >>
> > > >> Traceback (most recent call last):
> > > >> File "<stdin>", line 1, in <module>
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> > > >> line 390, in execute
> > > >> return c.execute(operation, parameters=parameters)
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > >> line 121, in execute
> > > >> raise _translate_exception(e) from e
> > > >> pymapd.exceptions.Error: Exception: Cannot group by string columns
> > which
> > > >> are not dictionary encoded.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> I also got an error that I could not join tables using TEXT type
> > > >> variables in OmniSci. This occurred when I was trying to merge in
the
> > new
> > > >> rpost and dpost values:
> > > >>
> > > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT
JOIN
> > mrg
> > > >> ON knn.neighbor_id = mrg.neighbor_id);")
> > > >> Traceback (most recent call last):
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > >> line 118, in execute
> > > >> at_most_n=-1,
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > >> line 1755, in sql_execute
> > > >> return self.recv_sql_execute()
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > >> line 1784, in recv_sql_execute
> > > >> raise result.e
> > > >> omnisci.thrift.ttypes.TOmniSciException:
> > > >> TOmniSciException(error_msg='Exception: Projection type TEXT not
> > supported
> > > >> for outer joins yet')
> > > >>
> > > >> The above exception was the direct cause of the following
exception:
> > > >>
> > > >> Traceback (most recent call last):
> > > >> File "<stdin>", line 1, in <module>
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> > > >> line 390, in execute
> > > >> return c.execute(operation, parameters=parameters)
> > > >> File
> > > >>
> >
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > >> line 121, in execute
> > > >> raise _translate_exception(e) from e
> > > >> pymapd.exceptions.Error: Exception: Projection type TEXT not
supported
> > > >> for outer joins yet
> > > >>
> > > >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***>
> > wrote:
> > > >> >
> > > >> >
> > > >> > Sure, take your time.
> > > >> >
> > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <
> > ***@***.***>
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Devika,
> > > >> > >
> > > >> > > You can disregard my last email, I am still troubleshooting
some
> > > >> things
> > > >> > > I’ll give a full report in a few hours.
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Jake
> > > >> > >
> > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar <
***@***.***>
> > > >> wrote:
> > > >> > > >
> > > >> > > >
> > > >> > > > Yes.
> > > >> > > >
> > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <
> > > >> ***@***.***>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Thanks ill look into this. Is one potential solution also
> > zipping
> > > >> the
> > > >> > > file
> > > >> > > > > such that it only has the extension .gz?
> > > >> > > > >
> > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar <
> > ***@***.***>
> > > >> > > wrote:
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > You are ready .tar.gz compressed file but in your
dataframe
> > > >> read CSV
> > > >> > > you
> > > >> > > > > > are mentioning .gz compressed. This is causing the
problem.
> > > >> Could you
> > > >> > > > > look
> > > >> > > > > > into how to read .tar.gz compression to dataframe.
> > > >> > > > > >
> > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> > > >> > > ***@***.***>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi Devika,
> > > >> > > > > > >
> > > >> > > > > > > After looking at this more one of the issues might
have
> > to do
> > > >> with
> > > >> > > how
> > > >> > > > > it
> > > >> > > > > > > is being read into Python. When I read in the tarred
file
> > > >> directly
> > > >> > > into
> > > >> > > > > > > python, there is a weird value in the first row and
first
> > > >> column
> > > >> > > > > > > intersection. This does not occur if I first unzip the
> > file
> > > >> and
> > > >> > > then
> > > >> > > > > load
> > > >> > > > > > > the .csv into Python. Why might this be happening? See
> > below:
> > > >> > > > > > >
> > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> > > >> > > > > > > low_memory='true',compression='gzip', header=None)
> > > >> > > > > > > df.head()
> > > >> > > > > > > >>> df.head()
> > > >> > > > > > > 0 1 2 3 4 5 6
> > > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10...
> > AK-709502
> > > >> i d
> > > >> > > 0 \N
> > > >> > > > > \N
> > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Compared to this when reading in the unzipped file:
> > > >> > > > > > >
> > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> > > >> > > low_memory='true',header=None)
> > > >> > > > > > > >>> df.head()
> > > >> > > > > > > 0 1 2 3 4 5 6
> > > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > >> > > > > > > >>>
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar <
> > > >> ***@***.***>
> > > >> > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > I think it is a memory issue. Please divide the
file in
> > > >> smaller
> > > >> > > size
> > > >> > > > > and
> > > >> > > > > > > > try again and let's see what happens.
> > > >> > > > > > > >
> > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> > > >> > > > > ***@***.***>
> > > >> > > > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue
but
> > also
> > > >> recall
> > > >> > > > > that
> > > >> > > > > > > last
> > > >> > > > > > > > > night the process died while reading one of the
> > smaller
> > > >> tables
> > > >> > > (RI)
> > > >> > > > > > > into
> > > >> > > > > > > > > OmniSci, so after successfully loading it into the
> > Python
> > > >> > > > > environment.
> > > >> > > > > > > > >
> > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> > > >> > > ***@***.***>
> > > >> > > > > > > wrote:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Then your dataframe is running out of memory to
> > read the
> > > >> > > whole
> > > >> > > > > file
> > > >> > > > > > > at
> > > >> > > > > > > > > once
> > > >> > > > > > > > > > since it's too big. Please read it in chunks,
look
> > into
> > > >> > > chunksize
> > > >> > > > > > > option
> > > >> > > > > > > > > > while using Pandas dataframe to modify the
script:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> > > >> > > > > > > > >
> > > >> > > > > > > > > —
> > > >> > > > > > > > > You are receiving this because you commented.
> > > >> > > > > > > > > Reply to this email directly, view it on GitHub
> > > >> > > > > > > > > <
> > > >> > > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> >
#13 (comment)
> > > >> > > > > > > >,
> > > >> > > > > > > > > or unsubscribe
> > > >> > > > > > > > > <
> > > >> > > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > > >> > > > > > > >
> > > >> > > > > > > > > .
> > > >> > > > > > > > >
> > > >> > > > > > > > —
> > > >> > > > > > > > You are receiving this because you authored the
thread.
> > > >> > > > > > > > Reply to this email directly, view it on GitHub <
> > > >> > > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> >
#13 (comment)
> > > >> > > > > >,
> > > >> > > > > > > or unsubscribe <
> > > >> > > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > > >> > > > > > > >.
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > —
> > > >> > > > > > > You are receiving this because you commented.
> > > >> > > > > > > Reply to this email directly, view it on GitHub
> > > >> > > > > > > <
> > > >> > > > >
> > > >> > >
> > > >>
> >
#13 (comment)
> > > >> > > > > >,
> > > >> > > > > > > or unsubscribe
> > > >> > > > > > > <
> > > >> > > > >
> > > >> > >
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> > > >> > > > > >
> > > >> > > > > > > .
> > > >> > > > > > >
> > > >> > > > > > —
> > > >> > > > > > You are receiving this because you authored the thread.
> > > >> > > > > > Reply to this email directly, view it on GitHub <
> > > >> > > > >
> > > >> > >
> > > >>
> >
#13 (comment)
> > > >> > > >,
> > > >> > > > > or unsubscribe <
> > > >> > > > >
> > > >> > >
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> > > >> > > > > >.
> > > >> > > > > >
> > > >> > > > >
> > > >> > > > > —
> > > >> > > > > You are receiving this because you commented.
> > > >> > > > > Reply to this email directly, view it on GitHub
> > > >> > > > > <
> > > >> > >
> > > >>
> >
#13 (comment)
> > > >> > > >,
> > > >> > > > > or unsubscribe
> > > >> > > > > <
> > > >> > >
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
> > > >> > > >
> > > >> > > > > .
> > > >> > > > >
> > > >> > > > —
> > > >> > > > You are receiving this because you authored the thread.
> > > >> > > > Reply to this email directly, view it on GitHub <
> > > >> > >
> > > >>
> >
#13 (comment)
> > > >> >,
> > > >> > > or unsubscribe <
> > > >> > >
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
> > > >> > > >.
> > > >> > > >
> > > >> > >
> > > >> > > —
> > > >> > > You are receiving this because you commented.
> > > >> > > Reply to this email directly, view it on GitHub
> > > >> > > <
> > > >>
> >
#13 (comment)
> > > >> >,
> > > >> > > or unsubscribe
> > > >> > > <
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA
> > > >> >
> > > >> > > .
> > > >> > >
> > > >> > —
> > > >> > You are receiving this because you authored the thread.
> > > >> > Reply to this email directly, view it on GitHub <
> > > >>
> >
#13 (comment)
> > >,
> > > >> or unsubscribe <
> > > >>
> >
https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA
> > > >> >.
> > > >> >
> > > >>
> > > >> —
> > > >> You are receiving this because you commented.
> > > >> Reply to this email directly, view it on GitHub
> > > >> <
> >
#13 (comment)
> > >,
> > > >> or unsubscribe
> > > >> <
> >
https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA
> > >
> > > >> .
> > > >>
> > > >
> > > >
> > > —
> > > You are receiving this because you authored the thread.
> > > Reply to this email directly, view it on GitHub <
> >
#13 (comment)
>,
> > or unsubscribe <
> >
https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA
> > >.
> > >
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <
#13 (comment)
>,
> > or unsubscribe
> > <
https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA
>
> > .
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA>
.
|
Did it work?
On Tue, Sep 15, 2020 at 8:08 PM Devika Kakkar <[email protected]>
wrote:
… Try:
*Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost,
b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);*
On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown ***@***.***>
wrote:
> Knn:
> source_id
> neighbor_id
> dist
>
>
> mrg:
> dpost
> rpost
> neighbor_id
>
> > On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***> wrote:
> >
> >
> > Pls send me column bames for both tables.
> >
> > On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***>
> wrote:
> >
> > > Thanks Devika,
> > >
> > > That seems to fix those issues. I think the remaining issue is the
> > > potential memory issue, which I can solve by outputting smaller
> files, and
> > > an issue when joining in sql/Omnisci. I am running up against a unique
> > > constraint error that I do not understand. The rpost/dpost data frame
> that
> > > I am joining to the knn output will have multiple matches, since I am
> > > joining it to neighbor_id, and sometimes people share neighbors.
> There are
> > > no duplicates in the rpost/dpost data frame, as it contains one row
> for
> > > each registered voter (or each potential neighbor, if you will). This
> kind
> > > of merge/join would not be a problem using similar functions in
> python/R,
> > > but seems to run up against a join difficulty in sql. Can you clarify
> what
> > > is going on?
> > >
> > > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN
> mrg ON
> > > knn.neighbor_id = mrg.neighbor_id);")
> > > Traceback (most recent call last):
> > > File
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > line 118, in execute
> > > at_most_n=-1,
> > > File
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > line 1755, in sql_execute
> > > return self.recv_sql_execute()
> > > File
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > line 1784, in recv_sql_execute
> > > raise result.e
> > > omnisci.thrift.ttypes.TOmniSciException:
> > > TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE
> constraint
> > > failed: mapd_columns.tableid, mapd_columns.name')
> > >
> > > The above exception was the direct cause of the following exception:
> > >
> > > ***@***.*** ~]$
> > > File "<stdin>", line 1, in <module>
> > > File
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> > > line 390, in execute
> > > return c.execute(operation, parameters=parameters)
> > > File
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > line 121, in execute
> > > raise _translate_exception(e) from e
> > > pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint
> > > failed: mapd_columns.tableid, mapd_columns.name
> > >
> > > > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***>
> wrote:
> > > >
> > > >
> > > > Please use TEXT ENCODING DICT wherever you define it.
> > > >
> > > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***>
> > > wrote:
> > > >
> > > > > The data type for source_id is STR
> > > > >
> > > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar <
> ***@***.***>
> > > > > wrote:
> > > > >
> > > > > What is the data type for source_id?
> > > > >
> > > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown <
> ***@***.***>
> > > > > wrote:
> > > > >
> > > > >>
> > > > >> Hi Devika,
> > > > >>
> > > > >> So I have figured out how to handle reading in the zipped files,
> and I
> > > > >> have been able to read in some of the smaller files to both
> Python and
> > > > >> OmniSci. The issues I am running into now involve running the
> > > modeling code
> > > > >> you provided, as am getting errors related to grouping on string
> > > columns.
> > > > >> You can see that output below:
> > > > >>
> > > > >> >>> conn.execute("Create table results as (SELECT source_id,
> > > AVG(dpost)
> > > > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost *
> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost *
> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY
> > > source_id);")
> > > > >> Traceback (most recent call last):
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > > >> line 118, in execute
> > > > >> at_most_n=-1,
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > > >> line 1755, in sql_execute
> > > > >> return self.recv_sql_execute()
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > > >> line 1784, in recv_sql_execute
> > > > >> raise result.e
> > > > >> omnisci.thrift.ttypes.TOmniSciException:
> > > > >> TOmniSciException(error_msg='Exception: Cannot group by string
> columns
> > > > >> which are not dictionary encoded.')
> > > > >>
> > > > >> The above exception was the direct cause of the following
> exception:
> > > > >>
> > > > >> Traceback (most recent call last):
> > > > >> File "<stdin>", line 1, in <module>
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> > > > >> line 390, in execute
> > > > >> return c.execute(operation, parameters=parameters)
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > > >> line 121, in execute
> > > > >> raise _translate_exception(e) from e
> > > > >> pymapd.exceptions.Error: Exception: Cannot group by string
> columns
> > > which
> > > > >> are not dictionary encoded.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> I also got an error that I could not join tables using TEXT type
> > > > >> variables in OmniSci. This occurred when I was trying to merge
> in the
> > > new
> > > > >> rpost and dpost values:
> > > > >>
> > > > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT
> JOIN
> > > mrg
> > > > >> ON knn.neighbor_id = mrg.neighbor_id);")
> > > > >> Traceback (most recent call last):
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > > >> line 118, in execute
> > > > >> at_most_n=-1,
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > > >> line 1755, in sql_execute
> > > > >> return self.recv_sql_execute()
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> > > > >> line 1784, in recv_sql_execute
> > > > >> raise result.e
> > > > >> omnisci.thrift.ttypes.TOmniSciException:
> > > > >> TOmniSciException(error_msg='Exception: Projection type TEXT not
> > > supported
> > > > >> for outer joins yet')
> > > > >>
> > > > >> The above exception was the direct cause of the following
> exception:
> > > > >>
> > > > >> Traceback (most recent call last):
> > > > >> File "<stdin>", line 1, in <module>
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> > > > >> line 390, in execute
> > > > >> return c.execute(operation, parameters=parameters)
> > > > >> File
> > > > >>
> > >
> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> > > > >> line 121, in execute
> > > > >> raise _translate_exception(e) from e
> > > > >> pymapd.exceptions.Error: Exception: Projection type TEXT not
> supported
> > > > >> for outer joins yet
> > > > >>
> > > > >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***
> >
> > > wrote:
> > > > >> >
> > > > >> >
> > > > >> > Sure, take your time.
> > > > >> >
> > > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <
> > > ***@***.***>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi Devika,
> > > > >> > >
> > > > >> > > You can disregard my last email, I am still troubleshooting
> some
> > > > >> things
> > > > >> > > I’ll give a full report in a few hours.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > >
> > > > >> > > Jake
> > > > >> > >
> > > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar <
> ***@***.***>
> > > > >> wrote:
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > Yes.
> > > > >> > > >
> > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <
> > > > >> ***@***.***>
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Thanks ill look into this. Is one potential solution also
> > > zipping
> > > > >> the
> > > > >> > > file
> > > > >> > > > > such that it only has the extension .gz?
> > > > >> > > > >
> > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar <
> > > ***@***.***>
> > > > >> > > wrote:
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > You are ready .tar.gz compressed file but in your
> dataframe
> > > > >> read CSV
> > > > >> > > you
> > > > >> > > > > > are mentioning .gz compressed. This is causing the
> problem.
> > > > >> Could you
> > > > >> > > > > look
> > > > >> > > > > > into how to read .tar.gz compression to dataframe.
> > > > >> > > > > >
> > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> > > > >> > > ***@***.***>
> > > > >> > > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Hi Devika,
> > > > >> > > > > > >
> > > > >> > > > > > > After looking at this more one of the issues might
> have
> > > to do
> > > > >> with
> > > > >> > > how
> > > > >> > > > > it
> > > > >> > > > > > > is being read into Python. When I read in the tarred
> file
> > > > >> directly
> > > > >> > > into
> > > > >> > > > > > > python, there is a weird value in the first row and
> first
> > > > >> column
> > > > >> > > > > > > intersection. This does not occur if I first unzip
> the
> > > file
> > > > >> and
> > > > >> > > then
> > > > >> > > > > load
> > > > >> > > > > > > the .csv into Python. Why might this be happening?
> See
> > > below:
> > > > >> > > > > > >
> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> > > > >> > > > > > > low_memory='true',compression='gzip', header=None)
> > > > >> > > > > > > df.head()
> > > > >> > > > > > > >>> df.head()
> > > > >> > > > > > > 0 1 2 3 4 5 6
> > > > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10...
> > > AK-709502
> > > > >> i d
> > > > >> > > 0 \N
> > > > >> > > > > \N
> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > Compared to this when reading in the unzipped file:
> > > > >> > > > > > >
> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> > > > >> > > low_memory='true',header=None)
> > > > >> > > > > > > >>> df.head()
> > > > >> > > > > > > 0 1 2 3 4 5 6
> > > > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> > > > >> > > > > > > >>>
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar <
> > > > >> ***@***.***>
> > > > >> > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > I think it is a memory issue. Please divide the
> file in
> > > > >> smaller
> > > > >> > > size
> > > > >> > > > > and
> > > > >> > > > > > > > try again and let's see what happens.
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> > > > >> > > > > ***@***.***>
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue
> but
> > > also
> > > > >> recall
> > > > >> > > > > that
> > > > >> > > > > > > last
> > > > >> > > > > > > > > night the process died while reading one of the
> > > smaller
> > > > >> tables
> > > > >> > > (RI)
> > > > >> > > > > > > into
> > > > >> > > > > > > > > OmniSci, so after successfully loading it into
> the
> > > Python
> > > > >> > > > > environment.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> > > > >> > > ***@***.***>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Then your dataframe is running out of memory to
> > > read the
> > > > >> > > whole
> > > > >> > > > > file
> > > > >> > > > > > > at
> > > > >> > > > > > > > > once
> > > > >> > > > > > > > > > since it's too big. Please read it in chunks,
> look
> > > into
> > > > >> > > chunksize
> > > > >> > > > > > > option
> > > > >> > > > > > > > > > while using Pandas dataframe to modify the
> script:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > —
> > > > >> > > > > > > > > You are receiving this because you commented.
> > > > >> > > > > > > > > Reply to this email directly, view it on GitHub
> > > > >> > > > > > > > > <
> > > > >> > > > > > >
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> #13 (comment)
> > > > >> > > > > > > >,
> > > > >> > > > > > > > > or unsubscribe
> > > > >> > > > > > > > > <
> > > > >> > > > > > >
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> > > > >> > > > > > > >
> > > > >> > > > > > > > > .
> > > > >> > > > > > > > >
> > > > >> > > > > > > > —
> > > > >> > > > > > > > You are receiving this because you authored the
> thread.
> > > > >> > > > > > > > Reply to this email directly, view it on GitHub <
> > > > >> > > > > > >
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> #13 (comment)
> > > > >> > > > > >,
> > > > >> > > > > > > or unsubscribe <
> > > > >> > > > > > >
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> > > > >> > > > > > > >.
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > —
> > > > >> > > > > > > You are receiving this because you commented.
> > > > >> > > > > > > Reply to this email directly, view it on GitHub
> > > > >> > > > > > > <
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> #13 (comment)
> > > > >> > > > > >,
> > > > >> > > > > > > or unsubscribe
> > > > >> > > > > > > <
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> > > > >> > > > > >
> > > > >> > > > > > > .
> > > > >> > > > > > >
> > > > >> > > > > > —
> > > > >> > > > > > You are receiving this because you authored the thread.
> > > > >> > > > > > Reply to this email directly, view it on GitHub <
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> #13 (comment)
> > > > >> > > >,
> > > > >> > > > > or unsubscribe <
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> > > > >> > > > > >.
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > > > —
> > > > >> > > > > You are receiving this because you commented.
> > > > >> > > > > Reply to this email directly, view it on GitHub
> > > > >> > > > > <
> > > > >> > >
> > > > >>
> > >
> #13 (comment)
> > > > >> > > >,
> > > > >> > > > > or unsubscribe
> > > > >> > > > > <
> > > > >> > >
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
> > > > >> > > >
> > > > >> > > > > .
> > > > >> > > > >
> > > > >> > > > —
> > > > >> > > > You are receiving this because you authored the thread.
> > > > >> > > > Reply to this email directly, view it on GitHub <
> > > > >> > >
> > > > >>
> > >
> #13 (comment)
> > > > >> >,
> > > > >> > > or unsubscribe <
> > > > >> > >
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
> > > > >> > > >.
> > > > >> > > >
> > > > >> > >
> > > > >> > > —
> > > > >> > > You are receiving this because you commented.
> > > > >> > > Reply to this email directly, view it on GitHub
> > > > >> > > <
> > > > >>
> > >
> #13 (comment)
> > > > >> >,
> > > > >> > > or unsubscribe
> > > > >> > > <
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA
> > > > >> >
> > > > >> > > .
> > > > >> > >
> > > > >> > —
> > > > >> > You are receiving this because you authored the thread.
> > > > >> > Reply to this email directly, view it on GitHub <
> > > > >>
> > >
> #13 (comment)
> > > >,
> > > > >> or unsubscribe <
> > > > >>
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA
> > > > >> >.
> > > > >> >
> > > > >>
> > > > >> —
> > > > >> You are receiving this because you commented.
> > > > >> Reply to this email directly, view it on GitHub
> > > > >> <
> > >
> #13 (comment)
> > > >,
> > > > >> or unsubscribe
> > > > >> <
> > >
> https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA
> > > >
> > > > >> .
> > > > >>
> > > > >
> > > > >
> > > > —
> > > > You are receiving this because you authored the thread.
> > > > Reply to this email directly, view it on GitHub <
> > >
> #13 (comment)
> >,
> > > or unsubscribe <
> > >
> https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA
> > > >.
> > > >
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub
> > > <
> #13 (comment)
> >,
> > > or unsubscribe
> > > <
> https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA
> >
> > > .
> > >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <
> #13 (comment)>,
> or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA
> >.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#13 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA>
> .
>
|
Seems to work right now yes. Thank you!
… On Sep 16, 2020, at 1:17 PM, dkakkar ***@***.***> wrote:
Did it work?
On Tue, Sep 15, 2020 at 8:08 PM Devika Kakkar ***@***.***>
wrote:
> Try:
>
>
> *Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist, b.dpost,
> b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);*
>
> On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown ***@***.***>
> wrote:
>
>> Knn:
>> source_id
>> neighbor_id
>> dist
>>
>>
>> mrg:
>> dpost
>> rpost
>> neighbor_id
>>
>> > On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***> wrote:
>> >
>> >
>> > Pls send me column bames for both tables.
>> >
>> > On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***>
>> wrote:
>> >
>> > > Thanks Devika,
>> > >
>> > > That seems to fix those issues. I think the remaining issue is the
>> > > potential memory issue, which I can solve by outputting smaller
>> files, and
>> > > an issue when joining in sql/Omnisci. I am running up against a unique
>> > > constraint error that I do not understand. The rpost/dpost data frame
>> that
>> > > I am joining to the knn output will have multiple matches, since I am
>> > > joining it to neighbor_id, and sometimes people share neighbors.
>> There are
>> > > no duplicates in the rpost/dpost data frame, as it contains one row
>> for
>> > > each registered voter (or each potential neighbor, if you will). This
>> kind
>> > > of merge/join would not be a problem using similar functions in
>> python/R,
>> > > but seems to run up against a join difficulty in sql. Can you clarify
>> what
>> > > is going on?
>> > >
>> > > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT JOIN
>> mrg ON
>> > > knn.neighbor_id = mrg.neighbor_id);")
>> > > Traceback (most recent call last):
>> > > File
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> > > line 118, in execute
>> > > at_most_n=-1,
>> > > File
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> > > line 1755, in sql_execute
>> > > return self.recv_sql_execute()
>> > > File
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> > > line 1784, in recv_sql_execute
>> > > raise result.e
>> > > omnisci.thrift.ttypes.TOmniSciException:
>> > > TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE
>> constraint
>> > > failed: mapd_columns.tableid, mapd_columns.name')
>> > >
>> > > The above exception was the direct cause of the following exception:
>> > >
>> > > ***@***.*** ~]$
>> > > File "<stdin>", line 1, in <module>
>> > > File
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
>> > > line 390, in execute
>> > > return c.execute(operation, parameters=parameters)
>> > > File
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> > > line 121, in execute
>> > > raise _translate_exception(e) from e
>> > > pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE constraint
>> > > failed: mapd_columns.tableid, mapd_columns.name
>> > >
>> > > > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***>
>> wrote:
>> > > >
>> > > >
>> > > > Please use TEXT ENCODING DICT wherever you define it.
>> > > >
>> > > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown ***@***.***>
>> > > wrote:
>> > > >
>> > > > > The data type for source_id is STR
>> > > > >
>> > > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar <
>> ***@***.***>
>> > > > > wrote:
>> > > > >
>> > > > > What is the data type for source_id?
>> > > > >
>> > > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown <
>> ***@***.***>
>> > > > > wrote:
>> > > > >
>> > > > >>
>> > > > >> Hi Devika,
>> > > > >>
>> > > > >> So I have figured out how to handle reading in the zipped files,
>> and I
>> > > > >> have been able to read in some of the smaller files to both
>> Python and
>> > > > >> OmniSci. The issues I am running into now involve running the
>> > > modeling code
>> > > > >> you provided, as am getting errors related to grouping on string
>> > > columns.
>> > > > >> You can see that output below:
>> > > > >>
>> > > > >> >>> conn.execute("Create table results as (SELECT source_id,
>> > > AVG(dpost)
>> > > > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost *
>> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost *
>> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY
>> > > source_id);")
>> > > > >> Traceback (most recent call last):
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> > > > >> line 118, in execute
>> > > > >> at_most_n=-1,
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> > > > >> line 1755, in sql_execute
>> > > > >> return self.recv_sql_execute()
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> > > > >> line 1784, in recv_sql_execute
>> > > > >> raise result.e
>> > > > >> omnisci.thrift.ttypes.TOmniSciException:
>> > > > >> TOmniSciException(error_msg='Exception: Cannot group by string
>> columns
>> > > > >> which are not dictionary encoded.')
>> > > > >>
>> > > > >> The above exception was the direct cause of the following
>> exception:
>> > > > >>
>> > > > >> Traceback (most recent call last):
>> > > > >> File "<stdin>", line 1, in <module>
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
>> > > > >> line 390, in execute
>> > > > >> return c.execute(operation, parameters=parameters)
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> > > > >> line 121, in execute
>> > > > >> raise _translate_exception(e) from e
>> > > > >> pymapd.exceptions.Error: Exception: Cannot group by string
>> columns
>> > > which
>> > > > >> are not dictionary encoded.
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> I also got an error that I could not join tables using TEXT type
>> > > > >> variables in OmniSci. This occurred when I was trying to merge
>> in the
>> > > new
>> > > > >> rpost and dpost values:
>> > > > >>
>> > > > >> >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT
>> JOIN
>> > > mrg
>> > > > >> ON knn.neighbor_id = mrg.neighbor_id);")
>> > > > >> Traceback (most recent call last):
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> > > > >> line 118, in execute
>> > > > >> at_most_n=-1,
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> > > > >> line 1755, in sql_execute
>> > > > >> return self.recv_sql_execute()
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
>> > > > >> line 1784, in recv_sql_execute
>> > > > >> raise result.e
>> > > > >> omnisci.thrift.ttypes.TOmniSciException:
>> > > > >> TOmniSciException(error_msg='Exception: Projection type TEXT not
>> > > supported
>> > > > >> for outer joins yet')
>> > > > >>
>> > > > >> The above exception was the direct cause of the following
>> exception:
>> > > > >>
>> > > > >> Traceback (most recent call last):
>> > > > >> File "<stdin>", line 1, in <module>
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
>> > > > >> line 390, in execute
>> > > > >> return c.execute(operation, parameters=parameters)
>> > > > >> File
>> > > > >>
>> > >
>> "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
>> > > > >> line 121, in execute
>> > > > >> raise _translate_exception(e) from e
>> > > > >> pymapd.exceptions.Error: Exception: Projection type TEXT not
>> supported
>> > > > >> for outer joins yet
>> > > > >>
>> > > > >> > On Sep 15, 2020, at 2:44 PM, dkakkar ***@***.***
>> >
>> > > wrote:
>> > > > >> >
>> > > > >> >
>> > > > >> > Sure, take your time.
>> > > > >> >
>> > > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <
>> > > ***@***.***>
>> > > > >> > wrote:
>> > > > >> >
>> > > > >> > > Hi Devika,
>> > > > >> > >
>> > > > >> > > You can disregard my last email, I am still troubleshooting
>> some
>> > > > >> things
>> > > > >> > > I’ll give a full report in a few hours.
>> > > > >> > >
>> > > > >> > > Thanks,
>> > > > >> > >
>> > > > >> > > Jake
>> > > > >> > >
>> > > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar <
>> ***@***.***>
>> > > > >> wrote:
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > Yes.
>> > > > >> > > >
>> > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <
>> > > > >> ***@***.***>
>> > > > >> > > > wrote:
>> > > > >> > > >
>> > > > >> > > > > Thanks ill look into this. Is one potential solution also
>> > > zipping
>> > > > >> the
>> > > > >> > > file
>> > > > >> > > > > such that it only has the extension .gz?
>> > > > >> > > > >
>> > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar <
>> > > ***@***.***>
>> > > > >> > > wrote:
>> > > > >> > > > > >
>> > > > >> > > > > >
>> > > > >> > > > > > You are ready .tar.gz compressed file but in your
>> dataframe
>> > > > >> read CSV
>> > > > >> > > you
>> > > > >> > > > > > are mentioning .gz compressed. This is causing the
>> problem.
>> > > > >> Could you
>> > > > >> > > > > look
>> > > > >> > > > > > into how to read .tar.gz compression to dataframe.
>> > > > >> > > > > >
>> > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
>> > > > >> > > ***@***.***>
>> > > > >> > > > > > wrote:
>> > > > >> > > > > >
>> > > > >> > > > > > > Hi Devika,
>> > > > >> > > > > > >
>> > > > >> > > > > > > After looking at this more one of the issues might
>> have
>> > > to do
>> > > > >> with
>> > > > >> > > how
>> > > > >> > > > > it
>> > > > >> > > > > > > is being read into Python. When I read in the tarred
>> file
>> > > > >> directly
>> > > > >> > > into
>> > > > >> > > > > > > python, there is a weird value in the first row and
>> first
>> > > > >> column
>> > > > >> > > > > > > intersection. This does not occur if I first unzip
>> the
>> > > file
>> > > > >> and
>> > > > >> > > then
>> > > > >> > > > > load
>> > > > >> > > > > > > the .csv into Python. Why might this be happening?
>> See
>> > > below:
>> > > > >> > > > > > >
>> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
>> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
>> > > > >> > > > > > > low_memory='true',compression='gzip', header=None)
>> > > > >> > > > > > > df.head()
>> > > > >> > > > > > > >>> df.head()
>> > > > >> > > > > > > 0 1 2 3 4 5 6
>> > > > >> > > > > > > 0 n/holyscratch01/enos_lab/jbrown613/data/knn_10...
>> > > AK-709502
>> > > > >> i d
>> > > > >> > > 0 \N
>> > > > >> > > > > \N
>> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
>> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
>> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
>> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
>> > > > >> > > > > > >
>> > > > >> > > > > > >
>> > > > >> > > > > > > Compared to this when reading in the unzipped file:
>> > > > >> > > > > > >
>> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
>> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
>> > > > >> > > low_memory='true',header=None)
>> > > > >> > > > > > > >>> df.head()
>> > > > >> > > > > > > 0 1 2 3 4 5 6
>> > > > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
>> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
>> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
>> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
>> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
>> > > > >> > > > > > > >>>
>> > > > >> > > > > > >
>> > > > >> > > > > > >
>> > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar <
>> > > > >> ***@***.***>
>> > > > >> > > > > wrote:
>> > > > >> > > > > > > >
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > I think it is a memory issue. Please divide the
>> file in
>> > > > >> smaller
>> > > > >> > > size
>> > > > >> > > > > and
>> > > > >> > > > > > > > try again and let's see what happens.
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
>> > > > >> > > > > ***@***.***>
>> > > > >> > > > > > > > wrote:
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > > Okay, thanks Devika. This might solve one issue
>> but
>> > > also
>> > > > >> recall
>> > > > >> > > > > that
>> > > > >> > > > > > > last
>> > > > >> > > > > > > > > night the process died while reading one of the
>> > > smaller
>> > > > >> tables
>> > > > >> > > (RI)
>> > > > >> > > > > > > into
>> > > > >> > > > > > > > > OmniSci, so after successfully loading it into
>> the
>> > > Python
>> > > > >> > > > > environment.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
>> > > > >> > > ***@***.***>
>> > > > >> > > > > > > wrote:
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > Then your dataframe is running out of memory to
>> > > read the
>> > > > >> > > whole
>> > > > >> > > > > file
>> > > > >> > > > > > > at
>> > > > >> > > > > > > > > once
>> > > > >> > > > > > > > > > since it's too big. Please read it in chunks,
>> look
>> > > into
>> > > > >> > > chunksize
>> > > > >> > > > > > > option
>> > > > >> > > > > > > > > > while using Pandas dataframe to modify the
>> script:
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > —
>> > > > >> > > > > > > > > You are receiving this because you commented.
>> > > > >> > > > > > > > > Reply to this email directly, view it on GitHub
>> > > > >> > > > > > > > > <
>> > > > >> > > > > > >
>> > > > >> > > > >
>> > > > >> > >
>> > > > >>
>> > >
>> #13 (comment)
>> > > > >> > > > > > > >,
>> > > > >> > > > > > > > > or unsubscribe
>> > > > >> > > > > > > > > <
>> > > > >> > > > > > >
>> > > > >> > > > >
>> > > > >> > >
>> > > > >>
>> > >
>> https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > > .
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > —
>> > > > >> > > > > > > > You are receiving this because you authored the
>> thread.
>> > > > >> > > > > > > > Reply to this email directly, view it on GitHub <
>> > > > >> > > > > > >
>> > > > >> > > > >
>> > > > >> > >
>> > > > >>
>> > >
>> #13 (comment)
>> > > > >> > > > > >,
>> > > > >> > > > > > > or unsubscribe <
>> > > > >> > > > > > >
>> > > > >> > > > >
>> > > > >> > >
>> > > > >>
>> > >
>> https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
>> > > > >> > > > > > > >.
>> > > > >> > > > > > > >
>> > > > >> > > > > > >
>> > > > >> > > > > > > —
>> > > > >> > > > > > > You are receiving this because you commented.
>> > > > >> > > > > > > Reply to this email directly, view it on GitHub
>> > > > >> > > > > > > <
>> > > > >> > > > >
>> > > > >> > >
>> > > > >>
>> > >
>> #13 (comment)
>> > > > >> > > > > >,
>> > > > >> > > > > > > or unsubscribe
>> > > > >> > > > > > > <
>> > > > >> > > > >
>> > > > >> > >
>> > > > >>
>> > >
>> https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
>> > > > >> > > > > >
>> > > > >> > > > > > > .
>> > > > >> > > > > > >
>> > > > >> > > > > > —
>> > > > >> > > > > > You are receiving this because you authored the thread.
>> > > > >> > > > > > Reply to this email directly, view it on GitHub <
>> > > > >> > > > >
>> > > > >> > >
>> > > > >>
>> > >
>> #13 (comment)
>> > > > >> > > >,
>> > > > >> > > > > or unsubscribe <
>> > > > >> > > > >
>> > > > >> > >
>> > > > >>
>> > >
>> https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
>> > > > >> > > > > >.
>> > > > >> > > > > >
>> > > > >> > > > >
>> > > > >> > > > > —
>> > > > >> > > > > You are receiving this because you commented.
>> > > > >> > > > > Reply to this email directly, view it on GitHub
>> > > > >> > > > > <
>> > > > >> > >
>> > > > >>
>> > >
>> #13 (comment)
>> > > > >> > > >,
>> > > > >> > > > > or unsubscribe
>> > > > >> > > > > <
>> > > > >> > >
>> > > > >>
>> > >
>> https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
>> > > > >> > > >
>> > > > >> > > > > .
>> > > > >> > > > >
>> > > > >> > > > —
>> > > > >> > > > You are receiving this because you authored the thread.
>> > > > >> > > > Reply to this email directly, view it on GitHub <
>> > > > >> > >
>> > > > >>
>> > >
>> #13 (comment)
>> > > > >> >,
>> > > > >> > > or unsubscribe <
>> > > > >> > >
>> > > > >>
>> > >
>> https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
>> > > > >> > > >.
>> > > > >> > > >
>> > > > >> > >
>> > > > >> > > —
>> > > > >> > > You are receiving this because you commented.
>> > > > >> > > Reply to this email directly, view it on GitHub
>> > > > >> > > <
>> > > > >>
>> > >
>> #13 (comment)
>> > > > >> >,
>> > > > >> > > or unsubscribe
>> > > > >> > > <
>> > > > >>
>> > >
>> https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA
>> > > > >> >
>> > > > >> > > .
>> > > > >> > >
>> > > > >> > —
>> > > > >> > You are receiving this because you authored the thread.
>> > > > >> > Reply to this email directly, view it on GitHub <
>> > > > >>
>> > >
>> #13 (comment)
>> > > >,
>> > > > >> or unsubscribe <
>> > > > >>
>> > >
>> https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA
>> > > > >> >.
>> > > > >> >
>> > > > >>
>> > > > >> —
>> > > > >> You are receiving this because you commented.
>> > > > >> Reply to this email directly, view it on GitHub
>> > > > >> <
>> > >
>> #13 (comment)
>> > > >,
>> > > > >> or unsubscribe
>> > > > >> <
>> > >
>> https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA
>> > > >
>> > > > >> .
>> > > > >>
>> > > > >
>> > > > >
>> > > > —
>> > > > You are receiving this because you authored the thread.
>> > > > Reply to this email directly, view it on GitHub <
>> > >
>> #13 (comment)
>> >,
>> > > or unsubscribe <
>> > >
>> https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA
>> > > >.
>> > > >
>> > >
>> > > —
>> > > You are receiving this because you commented.
>> > > Reply to this email directly, view it on GitHub
>> > > <
>> #13 (comment)
>> >,
>> > > or unsubscribe
>> > > <
>> https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA
>> >
>> > > .
>> > >
>> > —
>> > You are receiving this because you authored the thread.
>> > Reply to this email directly, view it on GitHub <
>> #13 (comment)>,
>> or unsubscribe <
>> https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA
>> >.
>> >
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <#13 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA>
>> .
>>
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILUUUAGR2ZPWTUERVGG3ALSGDXLRANCNFSM4RLYKCIA>.
|
Just FYI, you were trying to select all columns from both tables previously
and in the final merged table you cannot have two columns with the same
name (neigbor_id) so it was throwing a unique constraint.
On Wed, Sep 16, 2020 at 1:20 PM Jacob Brown <[email protected]>
wrote:
… Seems to work right now yes. Thank you!
> On Sep 16, 2020, at 1:17 PM, dkakkar ***@***.***> wrote:
>
>
> Did it work?
>
> On Tue, Sep 15, 2020 at 8:08 PM Devika Kakkar ***@***.***>
> wrote:
>
> > Try:
> >
> >
> > *Create table temp as (SELECT a.source_id, a.neighbor_id,a.dist,
b.dpost,
> > b.rpost FROM knn a LEFT JOIN mrg b ON a.neighbor_id = b.neighbor_id);*
> >
> > On Tue, Sep 15, 2020 at 8:03 PM Jacob Brown ***@***.***>
> > wrote:
> >
> >> Knn:
> >> source_id
> >> neighbor_id
> >> dist
> >>
> >>
> >> mrg:
> >> dpost
> >> rpost
> >> neighbor_id
> >>
> >> > On Sep 15, 2020, at 7:37 PM, dkakkar ***@***.***>
wrote:
> >> >
> >> >
> >> > Pls send me column bames for both tables.
> >> >
> >> > On Tue, Sep 15, 2020, 7:22 PM Jacob Brown ***@***.***
>
> >> wrote:
> >> >
> >> > > Thanks Devika,
> >> > >
> >> > > That seems to fix those issues. I think the remaining issue is the
> >> > > potential memory issue, which I can solve by outputting smaller
> >> files, and
> >> > > an issue when joining in sql/Omnisci. I am running up against a
unique
> >> > > constraint error that I do not understand. The rpost/dpost data
frame
> >> that
> >> > > I am joining to the knn output will have multiple matches, since
I am
> >> > > joining it to neighbor_id, and sometimes people share neighbors.
> >> There are
> >> > > no duplicates in the rpost/dpost data frame, as it contains one
row
> >> for
> >> > > each registered voter (or each potential neighbor, if you will).
This
> >> kind
> >> > > of merge/join would not be a problem using similar functions in
> >> python/R,
> >> > > but seems to run up against a join difficulty in sql. Can you
clarify
> >> what
> >> > > is going on?
> >> > >
> >> > > >>> conn.execute("Create table temp as (SELECT * FROM knn LEFT
JOIN
> >> mrg ON
> >> > > knn.neighbor_id = mrg.neighbor_id);")
> >> > > Traceback (most recent call last):
> >> > > File
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> > > line 118, in execute
> >> > > at_most_n=-1,
> >> > > File
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> > > line 1755, in sql_execute
> >> > > return self.recv_sql_execute()
> >> > > File
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> > > line 1784, in recv_sql_execute
> >> > > raise result.e
> >> > > omnisci.thrift.ttypes.TOmniSciException:
> >> > > TOmniSciException(error_msg='Exception: Sqlite3 Error: UNIQUE
> >> constraint
> >> > > failed: mapd_columns.tableid, mapd_columns.name')
> >> > >
> >> > > The above exception was the direct cause of the following
exception:
> >> > >
> >> > > ***@***.*** ~]$
> >> > > File "<stdin>", line 1, in <module>
> >> > > File
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> >> > > line 390, in execute
> >> > > return c.execute(operation, parameters=parameters)
> >> > > File
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> > > line 121, in execute
> >> > > raise _translate_exception(e) from e
> >> > > pymapd.exceptions.Error: Exception: Sqlite3 Error: UNIQUE
constraint
> >> > > failed: mapd_columns.tableid, mapd_columns.name
> >> > >
> >> > > > On Sep 15, 2020, at 3:57 PM, dkakkar ***@***.***>
> >> wrote:
> >> > > >
> >> > > >
> >> > > > Please use TEXT ENCODING DICT wherever you define it.
> >> > > >
> >> > > > On Tue, Sep 15, 2020 at 3:55 PM Jacob Brown <
***@***.***>
> >> > > wrote:
> >> > > >
> >> > > > > The data type for source_id is STR
> >> > > > >
> >> > > > > On Sep 15, 2020, at 3:53 PM, Devika Kakkar <
> >> ***@***.***>
> >> > > > > wrote:
> >> > > > >
> >> > > > > What is the data type for source_id?
> >> > > > >
> >> > > > > On Tue, Sep 15, 2020 at 3:52 PM Jacob Brown <
> >> ***@***.***>
> >> > > > > wrote:
> >> > > > >
> >> > > > >>
> >> > > > >> Hi Devika,
> >> > > > >>
> >> > > > >> So I have figured out how to handle reading in the zipped
files,
> >> and I
> >> > > > >> have been able to read in some of the smaller files to both
> >> Python and
> >> > > > >> OmniSci. The issues I am running into now involve running the
> >> > > modeling code
> >> > > > >> you provided, as am getting errors related to grouping on
string
> >> > > columns.
> >> > > > >> You can see that output below:
> >> > > > >>
> >> > > > >> >>> conn.execute("Create table results as (SELECT source_id,
> >> > > AVG(dpost)
> >> > > > >> as mean_d_post, AVG(rpost) as mean_r_post, SUM(dpost *
> >> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_d_post, SUM(rpost *
> >> > > > >> 1/(1+dist))/SUM(1/(1+dist)) as wtd_r_post FROM knn GROUP BY
> >> > > source_id);")
> >> > > > >> Traceback (most recent call last):
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> > > > >> line 118, in execute
> >> > > > >> at_most_n=-1,
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> > > > >> line 1755, in sql_execute
> >> > > > >> return self.recv_sql_execute()
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> > > > >> line 1784, in recv_sql_execute
> >> > > > >> raise result.e
> >> > > > >> omnisci.thrift.ttypes.TOmniSciException:
> >> > > > >> TOmniSciException(error_msg='Exception: Cannot group by
string
> >> columns
> >> > > > >> which are not dictionary encoded.')
> >> > > > >>
> >> > > > >> The above exception was the direct cause of the following
> >> exception:
> >> > > > >>
> >> > > > >> Traceback (most recent call last):
> >> > > > >> File "<stdin>", line 1, in <module>
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> >> > > > >> line 390, in execute
> >> > > > >> return c.execute(operation, parameters=parameters)
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> > > > >> line 121, in execute
> >> > > > >> raise _translate_exception(e) from e
> >> > > > >> pymapd.exceptions.Error: Exception: Cannot group by string
> >> columns
> >> > > which
> >> > > > >> are not dictionary encoded.
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >> I also got an error that I could not join tables using TEXT
type
> >> > > > >> variables in OmniSci. This occurred when I was trying to
merge
> >> in the
> >> > > new
> >> > > > >> rpost and dpost values:
> >> > > > >>
> >> > > > >> >>> conn.execute("Create table temp as (SELECT * FROM knn
LEFT
> >> JOIN
> >> > > mrg
> >> > > > >> ON knn.neighbor_id = mrg.neighbor_id);")
> >> > > > >> Traceback (most recent call last):
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> > > > >> line 118, in execute
> >> > > > >> at_most_n=-1,
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> > > > >> line 1755, in sql_execute
> >> > > > >> return self.recv_sql_execute()
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/omnisci/thrift/OmniSci.py",
> >> > > > >> line 1784, in recv_sql_execute
> >> > > > >> raise result.e
> >> > > > >> omnisci.thrift.ttypes.TOmniSciException:
> >> > > > >> TOmniSciException(error_msg='Exception: Projection type TEXT
not
> >> > > supported
> >> > > > >> for outer joins yet')
> >> > > > >>
> >> > > > >> The above exception was the direct cause of the following
> >> exception:
> >> > > > >>
> >> > > > >> Traceback (most recent call last):
> >> > > > >> File "<stdin>", line 1, in <module>
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py",
> >> > > > >> line 390, in execute
> >> > > > >> return c.execute(operation, parameters=parameters)
> >> > > > >> File
> >> > > > >>
> >> > >
> >>
"/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/cursor.py",
> >> > > > >> line 121, in execute
> >> > > > >> raise _translate_exception(e) from e
> >> > > > >> pymapd.exceptions.Error: Exception: Projection type TEXT not
> >> supported
> >> > > > >> for outer joins yet
> >> > > > >>
> >> > > > >> > On Sep 15, 2020, at 2:44 PM, dkakkar <
***@***.***
> >> >
> >> > > wrote:
> >> > > > >> >
> >> > > > >> >
> >> > > > >> > Sure, take your time.
> >> > > > >> >
> >> > > > >> > On Tue, Sep 15, 2020 at 1:51 PM Jacob Brown <
> >> > > ***@***.***>
> >> > > > >> > wrote:
> >> > > > >> >
> >> > > > >> > > Hi Devika,
> >> > > > >> > >
> >> > > > >> > > You can disregard my last email, I am still
troubleshooting
> >> some
> >> > > > >> things
> >> > > > >> > > I’ll give a full report in a few hours.
> >> > > > >> > >
> >> > > > >> > > Thanks,
> >> > > > >> > >
> >> > > > >> > > Jake
> >> > > > >> > >
> >> > > > >> > > > On Sep 15, 2020, at 1:11 PM, dkakkar <
> >> ***@***.***>
> >> > > > >> wrote:
> >> > > > >> > > >
> >> > > > >> > > >
> >> > > > >> > > > Yes.
> >> > > > >> > > >
> >> > > > >> > > > On Tue, Sep 15, 2020 at 1:09 PM Jacob Brown <
> >> > > > >> ***@***.***>
> >> > > > >> > > > wrote:
> >> > > > >> > > >
> >> > > > >> > > > > Thanks ill look into this. Is one potential solution
also
> >> > > zipping
> >> > > > >> the
> >> > > > >> > > file
> >> > > > >> > > > > such that it only has the extension .gz?
> >> > > > >> > > > >
> >> > > > >> > > > > > On Sep 15, 2020, at 1:02 PM, dkakkar <
> >> > > ***@***.***>
> >> > > > >> > > wrote:
> >> > > > >> > > > > >
> >> > > > >> > > > > >
> >> > > > >> > > > > > You are ready .tar.gz compressed file but in your
> >> dataframe
> >> > > > >> read CSV
> >> > > > >> > > you
> >> > > > >> > > > > > are mentioning .gz compressed. This is causing the
> >> problem.
> >> > > > >> Could you
> >> > > > >> > > > > look
> >> > > > >> > > > > > into how to read .tar.gz compression to dataframe.
> >> > > > >> > > > > >
> >> > > > >> > > > > > On Tue, Sep 15, 2020 at 12:57 PM Jacob Brown <
> >> > > > >> > > ***@***.***>
> >> > > > >> > > > > > wrote:
> >> > > > >> > > > > >
> >> > > > >> > > > > > > Hi Devika,
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > After looking at this more one of the issues
might
> >> have
> >> > > to do
> >> > > > >> with
> >> > > > >> > > how
> >> > > > >> > > > > it
> >> > > > >> > > > > > > is being read into Python. When I read in the
tarred
> >> file
> >> > > > >> directly
> >> > > > >> > > into
> >> > > > >> > > > > > > python, there is a weird value in the first row
and
> >> first
> >> > > > >> column
> >> > > > >> > > > > > > intersection. This does not occur if I first
unzip
> >> the
> >> > > file
> >> > > > >> and
> >> > > > >> > > then
> >> > > > >> > > > > load
> >> > > > >> > > > > > > the .csv into Python. Why might this be
happening?
> >> See
> >> > > below:
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.tar.gz',
> >> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> >> > > > >> > > > > > > low_memory='true',compression='gzip',
header=None)
> >> > > > >> > > > > > > df.head()
> >> > > > >> > > > > > > >>> df.head()
> >> > > > >> > > > > > > 0 1 2 3 4 5 6
> >> > > > >> > > > > > > 0
n/holyscratch01/enos_lab/jbrown613/data/knn_10...
> >> > > AK-709502
> >> > > > >> i d
> >> > > > >> > > 0 \N
> >> > > > >> > > > > \N
> >> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> >> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> >> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> >> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> >> > > > >> > > > > > >
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > Compared to this when reading in the unzipped
file:
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > >>> df = pd.read_csv('knn_1000_AK1_2012.csv',
> >> > > > >> > > > > > > sep='\t',dtype='unicode',index_col=None,
> >> > > > >> > > low_memory='true',header=None)
> >> > > > >> > > > > > > >>> df.head()
> >> > > > >> > > > > > > 0 1 2 3 4 5 6
> >> > > > >> > > > > > > 0 AK-787334 AK-709502 i d 0 \N \N
> >> > > > >> > > > > > > 1 AK-787334 AK-706032 i r 0 \N \N
> >> > > > >> > > > > > > 2 AK-787334 AK-647339 i r 0 \N \N
> >> > > > >> > > > > > > 3 AK-787334 AK-618324 i i 0 \N \N
> >> > > > >> > > > > > > 4 AK-787334 DC-567085 i i 0 \N \N
> >> > > > >> > > > > > > >>>
> >> > > > >> > > > > > >
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > > On Sep 15, 2020, at 11:14 AM, dkakkar <
> >> > > > >> ***@***.***>
> >> > > > >> > > > > wrote:
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > > > I think it is a memory issue. Please divide the
> >> file in
> >> > > > >> smaller
> >> > > > >> > > size
> >> > > > >> > > > > and
> >> > > > >> > > > > > > > try again and let's see what happens.
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > > > On Tue, Sep 15, 2020 at 11:11 AM Jacob Brown <
> >> > > > >> > > > > ***@***.***>
> >> > > > >> > > > > > > > wrote:
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > > > > Okay, thanks Devika. This might solve one
issue
> >> but
> >> > > also
> >> > > > >> recall
> >> > > > >> > > > > that
> >> > > > >> > > > > > > last
> >> > > > >> > > > > > > > > night the process died while reading one of
the
> >> > > smaller
> >> > > > >> tables
> >> > > > >> > > (RI)
> >> > > > >> > > > > > > into
> >> > > > >> > > > > > > > > OmniSci, so after successfully loading it
into
> >> the
> >> > > Python
> >> > > > >> > > > > environment.
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > > On Sep 15, 2020, at 11:09 AM, dkakkar <
> >> > > > >> > > ***@***.***>
> >> > > > >> > > > > > > wrote:
> >> > > > >> > > > > > > > > >
> >> > > > >> > > > > > > > > > Then your dataframe is running out of
memory to
> >> > > read the
> >> > > > >> > > whole
> >> > > > >> > > > > file
> >> > > > >> > > > > > > at
> >> > > > >> > > > > > > > > once
> >> > > > >> > > > > > > > > > since it's too big. Please read it in
chunks,
> >> look
> >> > > into
> >> > > > >> > > chunksize
> >> > > > >> > > > > > > option
> >> > > > >> > > > > > > > > > while using Pandas dataframe to modify the
> >> script:
> >> > > > >> > > > > > > > > >
> >> > > > >> > > > > > > > > > pd.read_csv(filename, chunksize=chunksize)
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > > —
> >> > > > >> > > > > > > > > You are receiving this because you commented.
> >> > > > >> > > > > > > > > Reply to this email directly, view it on
GitHub
> >> > > > >> > > > > > > > > <
> >> > > > >> > > > > > >
> >> > > > >> > > > >
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
#13 (comment)
> >> > > > >> > > > > > > >,
> >> > > > >> > > > > > > > > or unsubscribe
> >> > > > >> > > > > > > > > <
> >> > > > >> > > > > > >
> >> > > > >> > > > >
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2E752VJYFVUF45Q6Y3SF57YVANCNFSM4RLYKCIA
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > > > > .
> >> > > > >> > > > > > > > >
> >> > > > >> > > > > > > > —
> >> > > > >> > > > > > > > You are receiving this because you authored the
> >> thread.
> >> > > > >> > > > > > > > Reply to this email directly, view it on
GitHub <
> >> > > > >> > > > > > >
> >> > > > >> > > > >
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
#13 (comment)
> >> > > > >> > > > > >,
> >> > > > >> > > > > > > or unsubscribe <
> >> > > > >> > > > > > >
> >> > > > >> > > > >
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUC34VURCHAUNSRCDJTSF6ADRANCNFSM4RLYKCIA
> >> > > > >> > > > > > > >.
> >> > > > >> > > > > > > >
> >> > > > >> > > > > > >
> >> > > > >> > > > > > > —
> >> > > > >> > > > > > > You are receiving this because you commented.
> >> > > > >> > > > > > > Reply to this email directly, view it on GitHub
> >> > > > >> > > > > > > <
> >> > > > >> > > > >
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
#13 (comment)
> >> > > > >> > > > > >,
> >> > > > >> > > > > > > or unsubscribe
> >> > > > >> > > > > > > <
> >> > > > >> > > > >
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2BEHLGMXPEQJH3PXPLSF6MHLANCNFSM4RLYKCIA
> >> > > > >> > > > > >
> >> > > > >> > > > > > > .
> >> > > > >> > > > > > >
> >> > > > >> > > > > > —
> >> > > > >> > > > > > You are receiving this because you authored the
thread.
> >> > > > >> > > > > > Reply to this email directly, view it on GitHub <
> >> > > > >> > > > >
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
#13 (comment)
> >> > > > >> > > >,
> >> > > > >> > > > > or unsubscribe <
> >> > > > >> > > > >
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUBVOQMXI7X6AETFP6TSF6MZHANCNFSM4RLYKCIA
> >> > > > >> > > > > >.
> >> > > > >> > > > > >
> >> > > > >> > > > >
> >> > > > >> > > > > —
> >> > > > >> > > > > You are receiving this because you commented.
> >> > > > >> > > > > Reply to this email directly, view it on GitHub
> >> > > > >> > > > > <
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
#13 (comment)
> >> > > > >> > > >,
> >> > > > >> > > > > or unsubscribe
> >> > > > >> > > > > <
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2GEXEQJ2CMCEXFXP23SF6NTXANCNFSM4RLYKCIA
> >> > > > >> > > >
> >> > > > >> > > > > .
> >> > > > >> > > > >
> >> > > > >> > > > —
> >> > > > >> > > > You are receiving this because you authored the thread.
> >> > > > >> > > > Reply to this email directly, view it on GitHub <
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
#13 (comment)
> >> > > > >> >,
> >> > > > >> > > or unsubscribe <
> >> > > > >> > >
> >> > > > >>
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUCDX6BDN3N3LZ57333SF6N4HANCNFSM4RLYKCIA
> >> > > > >> > > >.
> >> > > > >> > > >
> >> > > > >> > >
> >> > > > >> > > —
> >> > > > >> > > You are receiving this because you commented.
> >> > > > >> > > Reply to this email directly, view it on GitHub
> >> > > > >> > > <
> >> > > > >>
> >> > >
> >>
#13 (comment)
> >> > > > >> >,
> >> > > > >> > > or unsubscribe
> >> > > > >> > > <
> >> > > > >>
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2FNZNC7MZ5PEAKWCO3SF6SQPANCNFSM4RLYKCIA
> >> > > > >> >
> >> > > > >> > > .
> >> > > > >> > >
> >> > > > >> > —
> >> > > > >> > You are receiving this because you authored the thread.
> >> > > > >> > Reply to this email directly, view it on GitHub <
> >> > > > >>
> >> > >
> >>
#13 (comment)
> >> > > >,
> >> > > > >> or unsubscribe <
> >> > > > >>
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUEWJW45S4UIIII5EJDSF6YZBANCNFSM4RLYKCIA
> >> > > > >> >.
> >> > > > >> >
> >> > > > >>
> >> > > > >> —
> >> > > > >> You are receiving this because you commented.
> >> > > > >> Reply to this email directly, view it on GitHub
> >> > > > >> <
> >> > >
> >>
#13 (comment)
> >> > > >,
> >> > > > >> or unsubscribe
> >> > > > >> <
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2GVDDQE2OYTJZSL6O3SF7AWBANCNFSM4RLYKCIA
> >> > > >
> >> > > > >> .
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > > —
> >> > > > You are receiving this because you authored the thread.
> >> > > > Reply to this email directly, view it on GitHub <
> >> > >
> >>
#13 (comment)
> >> >,
> >> > > or unsubscribe <
> >> > >
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUFCW7SSFL67V4JHHMTSF7BI3ANCNFSM4RLYKCIA
> >> > > >.
> >> > > >
> >> > >
> >> > > —
> >> > > You are receiving this because you commented.
> >> > > Reply to this email directly, view it on GitHub
> >> > > <
> >>
#13 (comment)
> >> >,
> >> > > or unsubscribe
> >> > > <
> >>
https://github.com/notifications/unsubscribe-auth/ACWCV2BC4OWPHIPVMCVP2VLSF7ZKDANCNFSM4RLYKCIA
> >> >
> >> > > .
> >> > >
> >> > —
> >> > You are receiving this because you authored the thread.
> >> > Reply to this email directly, view it on GitHub <
> >>
#13 (comment)
>,
> >> or unsubscribe <
> >>
https://github.com/notifications/unsubscribe-auth/AILUUUHI5NLT26F7IM55GZLSF73DXANCNFSM4RLYKCIA
> >> >.
> >> >
> >>
> >> —
> >> You are receiving this because you commented.
> >> Reply to this email directly, view it on GitHub
> >> <
#13 (comment)
>,
> >> or unsubscribe
> >> <
https://github.com/notifications/unsubscribe-auth/ACWCV2GBJPSW2RKAC5OCQELSF76DHANCNFSM4RLYKCIA
>
> >> .
> >>
> >
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <
#13 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AILUUUAGR2ZPWTUERVGG3ALSGDXLRANCNFSM4RLYKCIA
>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWCV2HRVXHD7CMGZ63JGCLSGDXVRANCNFSM4RLYKCIA>
.
|
While running the modified knn_model.py script I got the following error. It appears to be related to converting the merged table to omnisci, but I do not know what the error message "Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array" means or how to fix it:
(omnisci) [jbrown613@holygpu2c0705 neighbors]$ time python3 ~/sql/knn_model_merge.py
Connecting to Omnisci
Connected Connection(omnisci://admin:***@localhost:9893/omnisci?protocol=binary)
Traceback (most recent call last):
File "/n/home09/jbrown613/sql/knn_model_merge.py", line 37, in
conn.load_table("m",m,create='infer',method='arrow')
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 687, in load_table
return self.load_table_arrow(table_name, data)
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/connection.py", line 835, in load_table_arrow
data, metadata, preserve_index=preserve_index
File "/n/home09/jbrown613/.conda/envs/omnisci/lib/python3.6/site-packages/pymapd/_pandas_loaders.py", line 248, in serialize_arrow_payload
data = pa.RecordBatch.from_pandas(data, preserve_index=preserve_index)
File "pyarrow/table.pxi", line 704, in pyarrow.lib.RecordBatch.from_pandas
File "pyarrow/table.pxi", line 749, in pyarrow.lib.RecordBatch.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
real 5m54.114s
user 5m11.406s
sys 0m25.939s
The text was updated successfully, but these errors were encountered: